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Abstract.  It  has  become  apparent  in  recent  years  that  the  performance  of 
current  high  performance  computers,  from  powerful  workstations  to  massively 
parallel  processors,  is  strongly  dependent  on  the  behaviour  of  the  memory 
hierarchy.  In  fact,  it  does  not  only  affect  the  computation  time  but  the  time 
consumed  in  performing  communications.  In  this  research,  the  impact  of  the 
memory  hierarchy  usage  on  the  partitioning  of  multidimensional  regular 
domain  problems  is  studied.  We  use  as  an  example  the  numerical  solution  of  a 
three-dimensional  partial  differential  equation  in  a  regular  mesh,  by  means  of  a 
multigrid-like  iterative  method.  Experimental  results  contradict  the  traditional 
regular  partitioning  techniques  on  some  present  parallel  computers  like  the  Cray 
T3E  or  the  SGI  Origin  2000:  a  linear  decomposition  is  more  efficient  than  a 
three  dimensional  one  due  to  the  better  exploitation  of  the  spatial  data  locality. 
For  similar  reasons,  computation-communication  overlapping  increases  also 
execution  time. 


1.  Introduction 


The  performance  of  current  parallel  computers,  composed  of  up  to  hundreds  of 
superscalar  commodity  microprocessors,  presents  an  increasing  dependence  on  the 
effective  usage  of  their  hierarchical  memory  structures.  Indeed,  the  maximum 
performance  that  can  be  obtained  in  current  microprocessors  is  limited  by  the  memory 
access.  The  peak  performance  of  the  microprocessors  has  increased  by  a  factor  of  4-5 
every  3  years  by  exploiting  the  increasing  integration  density,  reducing  the  clock 
cycle,  and  by  implementing  architectural  techniques  to  take  advantage  of  the  multiple 
levels  of  parallelism.  However,  the  memory  access  time  has  been  reduced  by  a  factor 
of  just  1 .5-2  over  the  same  period.  Thus,  the  latency  of  memory  access  in  terms  of 
processor  performance  grows  by  a  factor  of  2-3  every  three  years.  This  situation 
seems  likely  to  continue  over  the  next  few  years  and  it  has  been  suggested  that  such 
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trends  may  result  in  a  “memory  wall”  in  which  application  performance  is  entirely 
dominated  by  memory  access  time  [1][2]. 

The  common  technique  to  bridge  this  gap  and  hide  the  problem  is  by  using  a 
hierarchical  memory  structure  with  large  and  fast  cache  memories  close  to  the 
processor.  As  a  result,  the  memory  structure  has  a  strong  impact  on  the  design  and 
development  of  a  code,  and  the  programs  must  exhibit  spatial  and  temporal  locality  to 
make  efficient  use  of  the  cache  memory  and  so  keep  the  processor  busy.  The 
effectiveness  of  data  locality  has  been  well  demonstrated  in  the  LAPACK  project,  and 
major  research  has  just  begun  to  develop  cache-friendly  iterative  methods  [3]  [4]. 
However,  to  the  best  of  the  authors’  knowledge,  the  impact  of  the  memory  hierarchy 
usage  on  the  partitioning  has  not  previously  been  studied. 

In  this  research,  we  have  studied  applications  where  the  main  computational 
portion  of  the  program  belongs  to  a  class  of  kernels  known  as  stencils.  A  stencil  is  a 
matrix  computation  in  which  groups  of  neighbouring  data  elements  are  combined  to 
calculate  a  new  value.  This  type  of  computation  is  common  in  image  processing, 
geometric  modelling  and  solving  partial  differential  equations  by  means  of  finite 
difference  or  finite  volume.  The  simplest  approach  to  parallelizing  these  kinds  of 
regular  applications  distributes  the  data  among  the  processes,  and  each  process  runs 
essentially  the  same  program  on  its  share  of  the  data.  For  three-dimensional 
applications,  decompositions  in  the  x,  y,  and/or  z  dimensions  are  possible. 

During  the  last  decade,  a  d-dimensional  mesh  of  processors  has  been  considered  as 
the  best  partitioning  to  split  a  d-dimensional  regular  domain  because  in  this  way  the 
interconnection  network  is  more  efficiently  exploited  [5][6].  Furthermore, 
communication-computation  overlapping  techniques  are  performed  to  keep  the 
processor  busy  and  so  improve  the  parallel  efficiency.  However,  our  results  show  that 
in  modem  parallel  computers  it  is  more  important  to  make  effective  use  of  the  local 
memory  hierarchy  than  to  reduce  the  overheads  due  to  network  delay  cost.  The 
interconnection  systems  have  also  taken  advantage  of  the  increasing  integration 
density  offered  by  the  integrated  circuit  processing  technology  and  the  effective 
bandwidth  and  latency  are  now  hundreds  of  times  faster  than  ten  years  ago. 

This  paper  is  organised  as  follows.  In  Section  2  we  describe  the  sample  code  that 
has  been  used  in  our  research.  The  effect  of  spatial  locality  on  message  sending  is 
described  in  Section  3.  Based  on  this  analysis,  the  choice  of  an  optimal  partition  is 
presented  in  Section  4.  The  influence  of  overlapping  computations  with 
communications  is  presented  in  Section  5.  The  paper  ends  with  some  conclusions  to 
guide  the  partitioning  of  regular  applications  in  current  parallel  computers. 


2.  Sample  Code. 

In  this  research,  we  are  only  interested  in  a  qualitative  description  of  the  most 
important  aspects  that  affect  the  performance,  and  that  should  be  considered  for 
making  informed  design  decisions.  As  a  sample  problem,  we  have  studied  the 
numerical  solution  of  a  time-dependent  partial  differential  equation,  the  three- 
dimensional  Bose-Einstein  equation  [7],  in  a  regular  mesh  subject  to  Dirichlet 
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boundary  conditions.  The  problem  is  to  describe  the  evolution  of  a  physical  field  (a 
complex  function)  given  an  initial  condition.  An  implicit  finite  difference  method  has 
been  used  to  carry-out  the  simulation,  and  the  systems  of  equations  are  solved  by 
means  of  a  multigrid-like  iterative  method  [8].  The  execution  times  that  we  present  in 
this  paper  are  the  result  of  a  single  time  step  simulations  using  only  one  multigrid 
iteration. 

Like  other  regular  applications,  the  parallel  program  execution  is  a  sequence  of 
computation  and  communication  steps.  The  subdomains  of  every  processor  are 
independently  computed  and  then,  a  communication  between  neighbouring  logical 
processors  updates  the  boundaries  of  these  subdomains. 

The  code  used  in  this  study  parallelizes  well  for  a  number  of  reasons.  The 
discretization  is  regular,  and  the  same  operations  are  applied  at  each  grid  point,  even 
though  the  evolution  of  the  system  is  non-linear.  Thus,  the  problem  can  be  statically 
load-balanced  at  the  start  of  the  code. 


3.  Spatial  Locality  Impact  on  Message  Sending. 

Message  sending  between  two  tasks  located  on  different  processors  can  be  divided 
into  three  phases:  two  of  them  are  where  the  processors  interface  with  the 
communication  system  (the  send  and  receive  overhead  phases),  and  a  network  delay 
phase,  where  the  data  is  transmitted  between  the  physical  processors.  Details  of  what 
the  system  does  during  these  phases  varies.  Typically,  however,  during  the  send 
overhead  phase  the  message  is  copied  into  a  system-controlled  message  buffering 
area,  and  control  information  is  appended  to  the  message.  In  the  same  way,  on  the 
receiving  process,  the  message  is  copied  from  a  system-controlled  buffering  area  into 
user-controlled  memory  (receive  overhead  is  usually  larger  than  send  overhead): 

In  several  out-of-date  parallel  computers,  like  the  TTiinking  Machines  CMS,  the 
Parsys  Supernode  1000  or  the  Meiko  CS-2,  the  most  important  component  was  the 
network  delay  [9].  However,  in  current  machines  like  the  Cray  T3E  or  the  SGI  Origin 
2000,  as  the  interconnection  networks  increase  their  bandwidth,  the  send  and  receive 
overheads  are  becoming  important.  The  factors  determining  these  overheads  are 
different  in  each  system,  but  they  are  mainly  due  to  uncached  operation,  misses  and 
synchronisation  instructions,  generally  considered  to  be  infrequent  events  and 
therefore  a  low  priority  for  architectural  optimisations  of  commodity  microprocessors. 
The  use  of  these  components  allows  a  rapidly  increasing  performance  and  excellent 
price  performance,  but  microprocessors  are  designed  for  workstations  and  modestly 
parallel  servers.  A  large-scale  multiprocessor  creates  a  foreign  environment  into 
which  they  are  ill-  equipped  to  fit.  For  example,  the  memory  interfaces  are  cache  line 
based,  making  references  to  single  words  (corresponding  to  strided  or  scatter/gather 
references  in  a  vector  machine)  inefficient  [10].  Therefore,  the  cost  of  communication 
depends  not  only  on  the  amount  of  communication  but  also  on  how  it  is  structured  to 
interact  with  the  architecture  (mainly  the  spatial  data  locality). 
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3.1  The  Cray  T3E  Message  Passing  Performance 

The  T3E  used  in  this  study  had  32  DEC  Alpha  21164  running  at  300  MHz  at  the 
beginning  of  our  research,  and  has  recently  been  upgraded  with  450  MHz  processors. 
Like  the  T3D,  The  T3E  contains  no  board-level  cache,  but  the  Alpha  21164  has  two 
levels  of  caching  on-chip:  8  KB  first-level  instructions  and  data  caches,  and  a  unified, 
3-way  associative,  96-Kbyte  write-back  second-level  cache.  The  local  memory  is 
distributed  across  eight  banks,  and  its  bandwidth  is  enhanced  by  a  set  of  hardware 
stream  buffers.  These  buffers,  which  exploit  spatial  locality  alone,  can  take  the  place 
of  a  large  board-level  cache,  which  is  designed  to  exploit  both  spatial  and  temporal 
locality.  Each  node  augments  the  memory  interface  of  the  processor  with  640  (512 
user  and  128  system)  external  registers  (E-registers).  They  serve  as  the  interface  for 
message  sending;  packets  are  transmitted  by  first  assembling  them  in  an  aligned  block 
of  8  E-registers. 

The  processors  are  connected  via  a  3D  torus  with  an  inter-processor 
communication  bandwidth  of  480  Mbytes/sec.  Using  MPI,  however,  the  effective 
bandwidth  is  smaller  due  to  overhead  associated  with  buffering  and  with  deadlock 
detection.  The  library  message  passing  mechanism  uses  the  E-registers  to  implement 
transfers,  directly  from  memory  to  memory.  Data  does  not  cross  the  processor  bus;  it 
flows  from  memory  into  E-registers  and  out  to  memory  again  in  the  receiving 
processor.  E-registers  enhance  performance  when  no  locality  is  available  by  allowing 
the  on-chip  caches  to  be  bypassed.  However,  if  the  data  to  be  loaded  were  in  the  data 
cache,  then  accessing  that  data  via  E-registers  would  be  sub-optimal  because  the 
cache-backmap  would  first  have  to  flush  the  data  from  data  cache  to  memory 
[9][10][I1]. 


Fig.  1.  CRAY  T3E  message  passing  performance  for  contiguous  data.  The  network  distance 
between  the  processors  involved  in  the  communication  varies. 

Figure  I  shows  the  measured  one-way  communication  bandwidth  for  different 
message  sizes  using  MPI.  The  test  program  uses  all  of  the  28  processors  available  in 
the  system.  There  is  always  the  same  sender  processor  and  one  receiver  processor  that 
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varies.  The  sender  initiates  an  immediate  send  followed  by  an  immediate  receive,  then 
it  waits  until  both  the  send  and  the  receive  have  been  completed.  The  receiver  begins 
by  starting  an  immediate  receive  operation,  then  waits  until  it  is  finished.  It  replies 
with  another  message  using  a  send/wait  combination.  Because  this  operation  is 
repeated  many  times,  if  all  the  data  fits  into  the  cache  then,  except  for  the  first  echo, 
the  required  data  will  be  found  in  the  cache.  But,  on  the  CRAY  T3E,  the  suppress 
directive  [12]  can  be  used  to  invalidate  the  entire  cache  and  so,  it  forces  all  entities  in 
the  cache  to  be  read  from  memory.  The  measures  demonstrate  that  there  is  no 
difference  between  close  and  distant  processors  in  the  CRAY  T3E. 

Figure  2  shows  the  impact  of  the  spatial  data  locality.  We  use  also  the  simple  echo 
test,  but  we  modify  the  data  locality  by  means  of  different  strides  between  successive 
elements  of  the  message.  The  stride  is  the  number  of  double  precision  data  between 
successive  elements  of  the  message,  so  stride- 1  represents  contiguous  data.  We  use 
MPI  datatypes  {MPI_Type_vector)  instead  of  the  MPI_Pack  /  MPI_Unpack 
routines,  because  they  may  allow  certain  performance  optimisations.  However,  we 
must  be  careful  because  the  use  of  certain  MPI  datatypes  can  dramatically  slow  down 
communication  performance,  e.g.,  the  MPI_Type_hvector  type  in  the  T3E 
implementation.  We  send  buffers  that  are  8-byte  aligned  because  the  T3E  copies  non- 
aligned  data  slowly.  This  is  automatic  for  the  usual  case  of  sending  double  precision 
data.  Due  to  memory  constraints  the  larger  message  is  limited  to  32Kbytes,  although 
it  is  not  big  enough  to  obtain  the  asymptotic  bandwidth  for  the  stride- 1  case,  these 
sizes  are  similar  to  the  messages  used  in  our  application  program. 


woo  10000  15000  20000  25000  30000  35000 

_ Message  Size  (Bytes) 


-Stride  2 
-  Stride  32 
Stride  512 


Stride  4 
-Stride  64 
Stride  1024 


-  Stride  8 
-Stride  128 
Stitf  e  2048 


Fig.  2.  CRAY  T''t  mcAvagc  passing  performance  using  non-contiguous  data 

It  is  interesting  to  note  ihji  almost  the  same  effective  bandwidth  is  obtained  for 
strides  between  16  and  '  1 double  precision  data.  For  32  KB  messages,  stride- 1 
bandwidth  is  around  5  iimev  tx-iicr  than  stride- 16.  Beyond  Stride- 1024  this  difference 
grows,  being  stride- 1  10  times  better  than  stride-2048. 
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3.2  SGI  Origin  2000  Message  Passing  Performance 

We  repeated  these  tests  in  a  SGI  Origin  2000.  The  Origin  is  a  distributed  shared- 
memory  system  with  a  hypercube  network  in  which  each  processing  node  contains 
two  processors,  a  portion  of  the  shared  memory,  a  directory  for  cache  coherence,  and 
interfaces  to  I/O  devices  and  other  system  nodes.  The  system  used  in  this  study  has 
the  MIPS  R 10000  running  at  195  MHz.  Each  processor  has  a  32  Kbyte  two-way  set- 
associative  primary  data  cache  and  a  4-Mbyte  two-way  set-associative  secondary  data 
cache.  One  important  difference  between  this  system  and  the  T3E  is  that  it  caches 
remote  data,  while  the  T3E  does  not.  The  memory  bandwidth  per  node  is  780 
Mbytes/sec.  Latencies  to  the  memory  modules  of  the  Origin  2000  system  depend  on 
the  network  distance  from  the  issuing  processor  to  the  destination  memory  node. 
Accesses  to  local  memory  take  80  clock  cycles  (CC)  (400  ns),  while  latencies  to 
remote  nodes  are  the  local  memory  time  plus  22  CC  (1 10  ns)  for  each  network  router, 
plus  a  one-time  penalty  of  33  CC  for  a  remote  access.  On  a  32-processor  machine,  the 
maximum  distance  covers  4  routers,  so  that  the  longest  memory  access  is  about  201 
CC(1005  ns)  [13][14][15]. 

However,  as  in  the  CRAY  T3E,  using  MPI,  the  time  required  to  send  a  message 
from  one  processor  to  another  is  almost  independent  of  both  processor  locations.  We 
have  measured  erratic  differences  of  around  7%. 


20.00 
0  00  ^ 


4000000  8000000  12000000  16000000 

Message  size 


Fig.  3.  SGI  Origin  mc\sjge  passing  performance  for  contiguous  data.  The  network 
distance  between  the  prixr»«->f>  involved  in  the  communication  varies 


It  is  interesting  to  note  ihji  ihc  measured  bandwidth  slows  down  when  the  message 
sizes  are  larger- than  the  ve>.  ind  level  cache  (4  MB).  Figure  4  shows  the  impact  of  the 
spatial  data  locality,  the  legend  on  the  right  is  the  number  of  double  precision  data 
between  successive  elemeniv  To  avoid  temporal  locality  effects  we  build  and  free  the 
message  every  echo  operation 
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Fig.  4.  SGI  Origin  2000  message  passing  performance  using  non-contiguous  data  . 

For  non-contiguous  data,  the  reduction  in  the  effective  bandwidth  is  even  greater 
than  in  the  T3E  case.  For  256  KB  messages,  stride- 1  bandwidth  is  around  6.3  times 
better  than  stride-2.  This  difference  grows  with  the  stride,  being  23  times  for  stride- 
256.  The  memory  interface  of  the  Origin  is  cache  line  based,  making  references  to 
single  data  more  inefficient  than  in  the  Cray  T3E.  Moreover,  the  current  MPI 
implementation  on  the  Origin  2000  requires  one  extra  buffer  copy. 


3.4  Experimental  Results  in  Our  Sample  Code 

Although  the  communication  pattern  that  we  found  in  our  application  program  is 
not  a  one-way  transfer,  but  a  message  exchange  between  neighbouring  logical 
processors,  we  notice  the  impact  of  the  spatial  locality  as  well.  In  this  data  exchange, 
advantage  can  be  taken  of  bi-directional  links,  and  a  greater  bandwidth  can  be 
obtained  than  is  possible  with  the  echo  test.  The  code  was  written  in  C,  so  a  three 
dimensional  domain  is  stored  in  a  row-ordered  (x,y,z)-array.  It  can  be  distributed 
across  a  ID  mesh  of  processors  following  three  possible  partitionings:  x-direction,  y- 
direction  and  z-direction.  The  x  and  y-direction  partitioning  were  found  to  be  more 
efficient,  because  the  message  data  exhibits  a  better  spatial  locality.  X  and  Y 
boundaries  are  stride-!  data,  except  strides  between  different  Z-columns  (two 
complex  data,  i.e.  four  doubles,  for  X-partitioning  and  this  quantity  plus  two  times  the 
number  of  elements  in  a  x-plane  for  Y-partitioning).  A  message  using  Z-partitioning 
has  a  stride  2  times  the  number  of  elements  in  dimension  z  (all  the  elements  are 
double  precision  complex  data).  Figures  5  and  6  show  the  experimental  results  from 
the  CRAY  T3E  and  the  SGI  Origin  2000  respectively.  Due  to  main  memory  capacity, 
the  SGI  allows  larger  simulations. 

X-partitioning  is  found  to  be  2  times  better  than  Z-partitioning  for  the  128-element 
simulation  on  the  two  different  configurations  of  the  CRAY  T3E.  Although  message¬ 
passing  bandwidth  is  very  important,  we  should  also  note  that  this  difference  is  not 
only  a  message  passing  effect.  X  and  Y-partitioning  more  efficiency  exploit  stream 
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buffers  because  they  maximise  inner  loop  iterations  [11],  By  means  of  the  MPP 
Apprentice  performance  tool  we  have  found  that  the  time  spent  in  the  initiation  of 
message  sending  is  5  times  larger  in  the  Z-partitioning  simulations.  This  fact  fits  in 
with  what  we  measure  in  the  echo  test. 
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Fig.  5.  Different  linear  partitioning  of  our  sample  application  using  sixteen  processor  in  the 
CRAY  T3E.  The  problem  size  is  the  number  of  cells  in  each  dimension  for  the  finest  grid  in  the 
multi  grid  hierarchy. 


■  X  Partitioning  BY  Partitioning  DZ  Partitioning 


Fig.  6.  Different  linear  panitioning  of  our  sample  application  using  32  processors  in  the  SGI 
Origin  2000.  The  problem  size  is  the  number  of  cells  in  each  dimension  for  the  finest  grid  in  the 
multigrid  hierarchy. 

Equivalent  differences  in  the  Origin  2000  are  important,  but  lower  than  the  T3E 
ones.  For  the  128'element  problem,  X  partitioning  is  only  1.2  times  better.  For  the 
256  one,  it  grows  to  1.4,  The  large  second-level  cache  of  this  system,  which  allows 
the  best  exploitation  of  the  temporal  locality,  influences  these  results  [16], 
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Using  2D  and  3D  decompositions,  we  notice  the  same  effects.  Z-plane  boundaries 
slow  down  the  performance  of  the  application  because  they  are  discontinuous  in 
memory.  Therefore,  as  figure  7  show,  a  2D  decomposition  using  a  4x4x1  array  of 
abstract  processors  (4  processors  in  the  x  and  y  dimensions  and  no  decomposition  in 
the  z  direction)  is  better  than  4x1x4  and  1x4x4  topologies  (the  differences  are  around 
15  %  in  the  Cray  T3E).  In  the  same  way,  a  3D  decomposition  using  a  4x2x2  array  is 
better  than  a  2x2x4  one. 
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Fig.  7.  Different  2D  decompositions  of  our  sample  application  using  16  processors  in  the 
CRAY  T3E.  The  problem  size  is  the  number  of  cells  in  each  dimension  for  the  finest  grid  in  the 
multigrid  hierarchy. 


4.  Partitioning  for  Performance 

Over  the  last  decade  the  partitioning  has  been  focused  on  reducing 
communications  that  are  inherent  to  the  parallel  program.  As  is  well  known,  for  a  d- 
dimensiona!  problem,  the  communication  requirements  for  a  process  grow 
proportionally  to  the  size  of  the  boundaries,  while  computations  grow  proportionally 
to  the  size  of  its  entire  partition.  The  communication  to  computation  ratio  is  thus  a 
perimeter-to-surface  area  ratio  in  a  two-dimensional  problem,  and  similarly,  a  surface 
area  to  volume  ratio  in  three-dimensions.  So,  the  three  dimensional  decomposition 
leads  to  a  lower  inherent  communication-to-computation  ratio. 

Moreover,  as  we  have  experimentally  proved  in  the  previous  section,  the  time 
required  for  sending  a  message  from  one  processor  to  another  is  independent  of  both 
processor  locations.  Therefore,  there  is  no  sense  in  talking  about  physical  neighbours, 
and  the  mapping  of  the  logical  processors  over  the  physical  ones  is  not  very 
important,  as  far  as  communication  locality  is  concerned. 

Therefore,  these  ideas  suggest  a  general  rule:  Higher-dimensional  decompositions 
tend  to  be  more  efficient  than  lower-dimensional  decompositions  [5][8]. 
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However,  as  we  discussed  in  the  previous  section,  the  communication  cost  is  also  a 
function  of  the  spatial  data  locality.  Therefore,  a  trade-off  between  the  improvement 
of  the  message  data  locality  and  the  efficient  exploitation  of  the  interconnection 
network  exists. 

The  following  figures  compare  the  different  decompositions  for  our  sample 
application  in  the  Cray  T3E.  In  the  larger  problem  using  8  processors,  and  for  the  new 
processor,  the  best  1  D-decomposition  achieves  improvements  of  6.5%  and  14,5% 
over  the  best  2D  and  3D-decompositions  respectively.  These  differences  have  grown 
by  2%  and  10  %  compared  to  the  old  300  MHz  configuration.  In  the  16-processor 
simulation  the  differences  are  lower  (only  2.2  %  and  7%)  for  the  same  problem  size 
because  the  local  matrices  are  smaller  too. 
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Fig.  8.  Different  decompositions  for  our  sample  program  in  the  CRAY  T3E  using  16  (on  the 
left)  and  8  processors  (on  the  right).  The  problem  size  is  the  number  of  cells  in  each  dimension 
for  the  finest  grid  in  the  multigrid  hierarchy. 


Fig.  9.  Different  decompositions  lor  our  .sample  program  in  the  SGI  Origin  2000  using  16  (on 
the  left)  and  8  processors  (on  the  right).  The  problem  size  is  the  number  of  cells  in  each 
dimension  tor  the  finest  grid  in  the  multigrid  hierarchy. 


In  the  SGI  Origin  2000.  we  have  obtained  lower  differences.  Using  8  processors, 
the  best  choice  is  also  a  linear  decomposition,  but  it  is  only  5%  and  7%  better  than  the 
2D  and  3D  decompositions.  However,  for  the  16-processor  simulation,  the  2D 
decomposition  is  15  %  and  1%  better  than  the  ID  and  3D  decompositions.  The  large 
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second-level  cache  of  this  system  is  again  the  reason  of  these  results.  Cray  T3E  is 
more  sensitive  to  spatial  data  locality  than  the  SGI  because  its  performance  depends 
significantly  on  the  effective  use  of  the  stream  buffers  system. 

Therefore,  in  both  multiprocessors,  it  is  more  important  to  make  effective  use  of 
the  local  memory  hierarchy  than  to  reduce  the  overheads  due  to  network  delay  cost. 
So,  the  best  performance  is  usually  obtained  by  means  of  a  simple  linear 
decomposition. 

We  should  also  note  that,  although  we  have  considered  execution  time  as  the 
performance  metric,  there  are  many  aspects  to  the  evaluation  of  a  parallel  program.  A 
lower-dimensional  partitioning  program  is  easier  to  code,  so  if  we  consider 
implementation  cost,  a  one-dimensional  partitioning  is  also  the  best  choice.  Besides,  it 
allows  the  implementation  of  fast  sequential  algorithms  in  the  non-partitioned 
directions  [17]. 

In  a  workstation  cluster  a  linear  data  distribution  is  also  the  best  because  the  fewer 
the  number  of  neighbours,  the  fewer  the  number  of  messages  to  be  sent.  Therefore,  a 
one-dimensional  decomposition  reduces  TCP/IP  overheads  as  well  [18].  So,  if  we 
consider  portability,  a  one-dimensional  partitioning  is  also  the  best  choice. 


5.  Computation  -  Communication  Overlapping. 

A  typical  approach  for  dealing  with  the  communication  cost  due  to  the  transit 
latency,  the  bandwidth-related  cost,  and  contention,  is  to  hide  it  by  overlapping  this 
part  of  the  communication  with  other  useful  work.  The  results  in  the  previous  .sections 
have  been  obtained  without  overlapping,  but  these  types  of  algorithms  can  be 
structured  so  that  every  process  request  for  remote  data  is  interleaved  explicitly  with 
local  computation.  For  this  purpose,  it  is  necessary  to  deal  with  the  boundaries  before 
the  inner  domain.  In  this  way,  it  is  possible  to  initiate  an  immediate  send  operation 
before  the  point  where  it  naturally  appears  in  the  program  and  the  message  may  reach 
the  receiver  before  it  is  actually  needed.  Thus,  the  receive  operation  does  not  stall 
waiting  for  the  message  to  arrive;  it  will  copy  the  data  straight  away  from  an 
incoming  buffer  into  the  application  address  space.  Therefore,  instead  of  using  the 
simple  pattern: 

I  -  Exchange  artificial  Boundary: 

Send  boundaries  to  neighbours 

Receive  artificial  boundaries  from  neighbours 

2-  Update  local  domain  using  artificial  boundaries 
we  must  use: 

1-  Update  boundaries 

2-  Send  boundaries  to  neighbours 

3-  Update  local  domain  using  artificial  boundaries 

4-  Receive  artificial  boundaries  from  neighbours 
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In  order  to  evaluate  the  benefits  and  limitations  of  this  new  approach,  we  will 
assume  that  message  initiation  and  reception  costs  are  the  same  in  the  two  structures, 
so  the  execution  time  can  be  estimated  as: 

Twithout_overlapping  =  Tlocal  +  Tcom_overhead  +  Tcom  .  (1) 

Toverlapping  =  Tboundaries  +  Tcom_overhead  +  max(Tinner,Tcom)  .  (2) 

Tlocal  is  the  time  spent  in  the  local  domain  update,  Tinner  is  the  cost  of  inner 
domain  actualisation,  Tboundaries  is  the  time  required  for  updating  the  boundaries, 
Tcom_overhead  is  the  send  and  receive  overheads  (it  is  important  to  recall  that  these 
overheads  incurred  on  the  processors  cannot  be  hidden)  and  Tcom  is  the  network 
delay.  For  a  real  problem,  Tcom  is  lower  than  Tinner.  Therefore,  the  overlapping 
pattern  is  better  than  the  simple  approach  while: 

Tboundaries  +  Tinner  <  Tlocal  +  Tcom  .  (3) 

Tlocal  can  be  divided  in  a  Tinner  and  a  Tboundaries_2,  so  the  last  inequality  can  be 
simplified  to: 

Tboundaries  -  Tboundaries_2  <  Tcom  .  (4) 

This  latter  boundary  actualisation  time  is  different  from  the  previous  one.  Usually, 
the  cost  of  updating  the  boundaries  in  the  non-overlapping  approach  (they  are  updated 
together  with  the  inner  local  domain)  is  lower  than  in  the  overlapping  pattern  due  to 
the  better  exploitation  of  the  memory  hierarchy. 

The  overlapping  approach  has  been  successfully  used  in  old  parallel  computers  like 
the  Parys  Supernode  SN  1000,  where  the  network  bandwidth-related  cost  is  very 
important.  In  workstations  clusters,  the  benefits  are  even  greater  because  the  network 
is  usually  a  non-private  resource  [18].  However,  as  we  have  discussed  in  the  previous 
sections,  in  the  current  generation  of  parallel  computers  Tcom  is  not  so  important. 
Therefore,  the  increase  due  to  the  boundary  actualisation  may  be  greater  than  the 
reduction  obtained  by  way  of  the  overlapping. 

We  have  verified  these  ideas  with  our  test  program.  Figure  10  illustrates  both 
patterns  using  a  linear  decomposition.  In  the  CRAY  T3E  the  non-overlapping 
approach  performance  is  7.3%  higher  than  the  overlap  pattern  for  the  16-processor 
simulation  (for  the  larger  problem  size  with  the  450  MHz  processor)  and  5%  higher 
using  8  processors.  These  differences  have  grown  compared  to  the  old  configuration 
where  the  differences  are  6.4%  and  4%  respectively.  Using  2D  and  3D 
decompositions  we  have  obtained  the  similar  differences  [16]. 

In  the  SGI,  the  differences  are  similar.  In  the  32  processor-simulation,  using  a 
linear  decomposition,  the  difference  for  the  larger  problem  is  7.5  %  [16], 
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Problem  size 


H  Simple  (300  Mhz)  ■  overlap  (300  Mhz) 
O Simple  (450  Mhz)  Doverlap  (450 Mhz) 


20 


64  128 

Problem  size 


■  Sinple  (3(X)  Mhz)  ■overlap  (300  Mhz) 
□  Simple  (450  Mhz)  □  overlap  (450  Mhz) 


Fig.  10.  Overlapping  versus  non-overlapping  approach  on  the  Cray  T3E  using  8  (on  the  left) 
and  16  processors  (on  the  right).  The  problem  size  is  the  number  of  cells  on  each  dimension  for 
the  finest  grid  in  the  multigrid  hierarchy. 


6.  Conclusions 

We  have  shown  how  the  optimal  data  partitioning  of  regular  domains  is  a  trade  off 
between  the  improvement  of  the  message  data  locality  and  the 
computation/communication  ratio.  In  older  parallel  computers  the  performance 
depends  mainly  on  the  efficient  exploitation  of  the  interconnection  network. 
However,  the  performance  obtained  on  current  parallel  computers,  based  on  the 
replication  of  commodity  microprocessors,  present  a  growing  dependence  on  the 
efficient  use  of  the  memory  hierarchy. 

The  main  conclusions  of  the  paper  can  be  summarized  in  the  following  points,  that 
contradict  to  a  certain  extent  the  traditional  wisdom  on  data  partitioning:  ( 1 )  the 
partioning  of  the  domain  must  avoid  boundaries  with  poor  data  locality  due  to  the 
reduction  in  the  effective  bandwidth,  (2)  ID  partitioning  is  becoming  more  efficient 
than  higher  dimension  partitioning  (Moreover,  it  is  easier  to  code,  more  suitable  to 
include  fast  sequential  algorithms  in  non-partitioned  directions  and  more  portable), 
and  (3)  communication/computation  overlapping  does  not  reduce  execution  time. 
These  conclusions  have  been  verified  by  experimental  results  on  two  microprocessor 
based  computers:  the  Cray  T3E  and  the  SGI  Origin  2000. 
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Abstract.  In  this  paper  we  present  the  results  of  benchmark  experi¬ 
ments  carried  out  on  a  Silicon  Graphics  0rigin2000.  We  used  the  three 
modules  of  the  EuroBen  Benchmark  ([1])  to  assess  the  performance  of  a 
single  node,  as  a  shared  memory  system,  and  as  a  distributed  memory 
system.  Where  the  situation  calls  for  it,  we  compare  the  results  with 
those  obtained  on  a  Cray  T3E  and  an  IBM  SP2.  The  results  obtained 
from  this  benchmark  give  a  good  impression  of  what  performances  can 
be  attained  on  the  0rigin2000  under  what  circumstances  and  expose  the 
weak  and  strong  points  of  the  system. 


Keywords:  Performance  analysis,  High-performance  computers,  Programming 
models. 


1  Introduction 

The  Silicon  Graphics  0rigin2000  has  been  introduced  in  the  last  quarter  of 
1996.  Since  then  a  considerable  amount  of  these  systems  have  been  installed, 
ranging  from  4-128  proci'ssors  per  system.  The  0rigin2000  machine  has  a  rather 
complicated  ajchite<'turp  and,  like  most  high-performance  computers,  shows  a 
wide  range  of  performance  levels  depending  on  memory  access  patterns,  loop 
content,  fitness  for  and  gram  size  of  parallelism,  etc.  It  was  our  intention  to  make 
a  performance  pmfdf  ■  >(  tb»-  0rigin2000  which  will  allow  to  obtain  a  fair  estimate 
of  the  performam  e  uikIct  a  variety  of  realistic  operating  circumstances.  At  the 
same  time,  archit(*<  Tiit.ii  t>.>ttlenecks  can  be  identified.  This  may  be  valuable  for 
future  system  dexcii  iptii. m  and  will  in  the  end  be  of  benefit  for  end  users. 

To  assess  the  petti  a  mance  of  the  0rigin2000  we  used  the  EuroBen  Bench¬ 
mark,  version  3.2  (jl  I  :.i'  benchmark  was  initially  designed  for  testing  shared- 
memory  MIMD  sysieiii>  Miiwever,  for  a  limited  number  of  important  cases  also 
message-passing  code>  lja\e  been  developed. 
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- :  XpressLink 

Fig.  1.  Configurations  of  0rigin2000  systems  with  16  and  32  processors. 


This  paper  has  the  following  structure:  first  the  0rigin2000  and  the  EuroBen 
Benchmark  are  briefly  described,  next  we  present  the  most  relevant  results  of 
our  benchmark  study  and  we  conclude  with  a  summary  and  issues  that  might 
be  addressed  in  further  research. 


2  The  0rigin2000  system 

The  0rigin2000  is  a  cache  coherent,  logically  shared,  physically  distributed  mem¬ 
ory  system  with  4-128  MIPS  RIOOOO  RISC  processors.  The  features  of  the  pro¬ 
cessors  are  extensively  described  in  [2,3].  These  include  out-of-order  execution 
of  instructions  and  prefetching  of  operands  in  order  to  hide  data-access  latency. 

The  system  as  we  have  benchmarked  contained  195  MHz  processors  with 
a  theoretical  peak  performance  of  390  Mflop/s.  The  processors  have  32  KB, 
two-way  set-associative  primary  instruction  and  data  caches  and  a  combined 
secondary  instruction  and  data  cache  of  4  MB.  In  parallel  processing  the  caches 
of  the  processors  involved  are  kept  coherent  via  a  directory  memory,  see  [2].  The 
memory  of  the  total  system  was  in  our  case  16  GB. 

Two  processors  are  mounted  on  a  node  card  together  with  a  local  part  of  the 
memory  and  a  HUB  chip,  an  ASIC  which  connects  all  components  on  the  node 
card  with  each  other  In  addition,  the  HUB  chip  also  connects  the  node  card  to 
the  other  node  cards  and  the  I/O  facilities  of  the  system.  The  raw  bandwidth 
of  the  connections  <»n  the  node  card  and  between  node  cards  is  780  MB/s,  see 
[4].  However,  the  powessors  have  to  share  this  bandwidth  when  accessing 
data  from  memor%  U't  actual  point-to-point  bandwidth  between  processors 
on  the  user  level  Silx 'm  ( ,t.tphics  quotes  a  bandwidth  of  150  MB/s.  This  is  due 
to  various  overheads  .uid  tlic  cache-coherency  that  is  enforced  by  the  system. 

Node  cards  are.  \  la  t.'i.-n  HUB  chip,  connected  by  routers  to  the  rest  of  the 
system.  The  intercotin*i  of  the  routers  has  a  hypercube  topology.  However, 
for  up  to  32  processor '  x  -  <  ailed  XpressLinks  can  be  added  to  reduce  the  system 
diameter  J?  to  3.  Figun  i  shows  some  system  configurations. 
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Silicon  Graphics  provides  auto-parallelising  compilers  that  attempt  to  spread 
the  content  of  loops  evenly  over  the  processors.  In  addition,  the  user  may  add  par¬ 
allelisation  directives  in  various  styles.  Next  to  SGI-proprietary,  also  ANSI  X3H5 
recommended  ([5])  and  OpenMP  ([6])  directives  are  accepted.  Also  distributed 
memory  message  passing  libraries  are  available.  Apart  from  the  SGI/Cray-style 
shmem  library,  MPI  ([7])  and  PVM  ([8])  are  supported.  An  HPF  compiler  ([9]) 
for  the  0rigin2000  is  distributed  by  the  Portland  Group. 

3  The  EuroBen  Benchmark 

To  get  a  complete  insight  in  the  behaviour  of  the  machine  one  has  to  investigate 
the  single-node  performance,  the  shared-memory  parallelisation  capabilities,  and 
the  possible  (dis) advantages  of  using  the  system  as  distributed  memory  system. 
The  EuroBen  Benchmark  has  been  build  in  a  hierarchical  way  to  extract  the 
necessary  information  and  to  build  the  performance  profile  from  programs  in 
three  modules  of  increasing  complexity: 

-  The  first  module  contains  programs  that  identify  the  machine  parameters 
that  govern  upper  and  lower  bounds  of  the  performance. 

-  The  second  module  contains  simple  but  basic  algorithms:  full  and  sparse 
linear  systems  solvers,  FFTs,  random  number  generation,  etc. 

-  The  third  module  places  the  algorithms  in  a  compact  application  setting 
and  applies  them  in  various  PDE  and  ODE  problem  implementations.  In 
addition,  linear  and  non-linear  least-squares  problems  and  some  I/O-bound 
problems  are  considered. 

For  a  full  description  of  the  benchmark  one  is  referred  to  [1]. 

3.1  Testing  circumstances 

The  full  benchmark  applied  on  single  nodes,  together  with  the  parallel  execu¬ 
tion  of  relevant  programs  from  the  benchmark  both  with  a  shared-memory  and 
a  distributed-memory  message-passing  programming  model  gives  a  sufficient  in¬ 
sight  in  the  machine  behaviour  to  enable  reasonable  performance  estimates  in 
many  circumstances.  For  the  shared-memory  programming  model  we  used  both 
the  SGI-proprietary  as  well  as  the  ANSI  X3H5  directives,  for  the  message-passing 
programs  MPI  was  used.  Moreover,  features  like  Inter  Procedural  Analysis  and 
the  quedity  of  the  numerical  libraries  provided  by  Silicon  Graphics  have  been 
assessed  to  complete  the  profile  of  the  machine.  Where  relevant,  to  compare  and 
contrast  the  distributed  memory  results  we  also  have  done  similar  tests  on  two 
other  widely  available  DM-.MIMD  systems,  a  Cray  T3E  Classic  and  a  IBM  SP. 
In  addition  some  results  from  a  Hitachi  SR2201  were  used. 

We  had  the  following  te.sting  circumstances  for  the  systems  quoted  in  this  paper: 

-  0rigin2000  The  Fortran  77  MIPSPro  compiler,  version  7.20,  compiler  op¬ 
tions  -03  -64  -OPT: IEEE: arithmetic=3: roundoff =3,  Operating  System 
IRIX  6.4  02121744.  For  the  hardware  specifications  seen  section  2. 
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-  IBM  RS6000/SP  We  used  IBM  RS6000/SP  Thinnodes  with  160MHz 
P2SC  processors  and  512  MB  memory  per  node.  The  Fortran  90  compiler  was 
xlf,  version  4.1,  compiler  options  were  -03  -qarch=pwr2,  Operating  System 
AIX,  version  2.4  002006959400. 

-  Cray  T3E  Classic  We  used  300  MHz  DEC  Alpha  21164  processors  with 
128  MB  memory  per  node.  The  Fortran  90  compiler  was  CF90,  version 
3. 0.1. 3,  compiler  options  were  -03  -dp,  Operating  System  UNICOS/mk, 
version  2.0.2.19. 

-  Hitachi  SR2201  We  used  200  MHz  PA-RISC  720  processors  with  256  MB 
of  memory  per  node.  The  Fortran  90  compiler  was  OFORT90,  version  V02- 
05-/A,  compiler  option  was  -03,  Operating  System  HI-UX/MPP,  version 
SR220001  02-02  0. 

In  all  cases  we  used  the  system  clock  with  resolutions  ranging  from  0.5-15  /xs. 
We  took  care  to  use  timing  measurement  intervals  of  at  least  a  few  hundred  ms 
to  exclude  measuring  artefacts,  repeating  measurements  where  necessary. 

4  Benchmark  results 

From  each  of  the  three  benchmark  modules  we  present  some  representative  re¬ 
sults  as  the  complete  discussion  of  all  results  is  far  to  extensive  for  this  paper. 
One  is  referred  to  the  report  [3]  for  a  comprehensive  presentation.  The  report  is 
downloadable  from:  http://www.phys.uu.nl/-steen/euroben/reports/  as  a 
compressed  PostScript  file. 

4.1  Module  1  results 

Program  mediae  measures  the  speed  of  a  number  of  important  basic  operations 
as  a  function  of  the  array  length.  With  the  bandwidth  to  the  CPU  known  we 
should  be  able  to  assess  whether  the  code  generated  by  the  compiler  is  optimal. 
In  Table  1  we  list  the  single-node  speeds  for  these  operations  with  stride  1  access 
to  the  operands  as  found  for  operation  from  the  level  1  and  level  2  cache. 

Program  modlac  obtains  which  the  speeds  of  the  operations  with  stride  1, 
3,  and  4  memory  access.  Moreover,  also  the  speeds  of  the  same  operations  is 
measured  when  accessing  the  operands  via  an  index  vector.  Non-unit  stride  ac¬ 
cess  turns  out  to  have  quite  little  influence  on  the  performance.  Indirect  indexed 
operations  incur  a  loss  of  roughly  30%  in  speed  due  to  address  operations.  So,  we 
present  only  the  stride- 1  values.  The  first  and  fourth  column  show  the  maximum 
observed  performance.  T'max-  when  accessed  from  the  primary  and  secondary 
cache,  respectively.  .A.s  the  secondary  cache  is  quite  large  (4  MB),  a  relatively 
small  proportion  of  data  references  will  have  to  be  to  the  main  memory. 

The  dependency  of  the  execution  time  of  the  array  length  can  be  modelled 
with  considerable  precision  by  a  linear  model  t{n)  =  a  +  bn  where  a  is  the  la¬ 
tency  and  h  is  the  time  per  operation  per  element.  These  parameters  are  given 
as  the  third  and  second  column  entries  of  Table  1.  It  enables  us  to  draw  defi¬ 
nite  conclusions  about  the  optimality  of  the  generated  code  for  the  operations 
considered. 
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1 

Operation 

L-1  cache 

^max 

Mflop/s 

L-1  cache 
Cycles  per 
op/element 

L-1  cache 
Latency 
cycles 

L-2  cache 

max 

Mflop/s 

1 

Broadcast 

195.60 

23 

61.92 

2 

Copy 

95.29 

15 

43.65 

3 

Addition 

64.66 

21 

34.36 

4 

Subtraction 

64.48 

18 

34.57 

5 

Multiplication 

64.45 

18 

34.52 

6 

Division 

9.23 

21 

0 

9.22 

7 

Dotproduct 

194.46 

2 

14 

137.61 

8 

X  ■=  X  +  ay 

128.92 

3 

19 

69.13 

9 

z  =  X  +  ay 

128.62 

3 

17 

66.99 

10 

y  =  X1X2  -h  X3X4 

107.39 

6 

23 

56.97 

11 

Ist  order  recurs. 

96.39 

2 

23 

46.04 

12 

2ud  order  recurs. 

96.69 

4 

22 

80.31 

13 

2nd  difference 

242.31 

2.5 

36 

132.54 

14 

9th  Degr.  Polynomial 

376.92 

9 

31 

351.17 

Table  1.  rn,ax,  the  number  of  cycles  per  operation  per  element,  and  the  latency  values 
for  the  primary  cache  operations  on  a  single  processor  of  the  0rigtn2000.  Only  results 
of  the  first  14  of  kernels  are  shown.  The  operations  all  have  unit  stride  access.  The 
operation  latency  from  secondary  cache  is  completely  hidden  by  the  data  access. 

The  dyadic  operations  addition,  subtraction,  and  multiplication  operate  at 
1/6^’’  of  the  Theoretical  Peak  Performance,  390  Mflop/s,  when  accessed  from  the 
primary  cache  as  the  total  operation  takes  3  cycles.  With  an  ideal  bandwidth 
situation,  transferring  two  operands  to  the  relevant  functional  unit  and  shipping- 
one  result  back  per  clock  cycle,  the  performance  should  approximately  be  half  the 
Theoretical  Peak  Performance.  One  can  conclude  that  only  one  8-byte  data  item 
can  be  transferred  per  cycle.  This  is  in  agreement  with  the  bandwidth  quoted  by 
the  vendor.  The  dotproduct  and  the  daxpy  operation  (kernel  7  and  8)  also  show 
speeds  that  closely  agree  with  this  bandwidth  with  computational  intensities  of 
1  and  2/3,  respectively  ([10]).  It  shows  that,  at  least  for  these  simple  operations, 
the  compiler  is  able  to  generate  optimal  code  given  the  limited  bandwidth  of  one 
operand/cycle.  With  a  high  reuse  of  operands,  like  the  evaluation  of  a  9*‘’-degree 
polynomial  and  a  computational  intensity  of  9,  a  large  fraction  of  the  Theoretical 
Peak  Performance  can  be  obtciined;  kernel  14  shows  a  performance  of  96%  of  the 
Theoretical  Peak  Performance. 

Shared- memory  parallel  perforrnaiice  of  program  mod  lac  Ideally,  the 
simple,  vector-oriented  operations  in  program  mediae  should  speed  up  almost 
linearly  with  the  number  of  processors  when  executed  in  parallel.  There  are 
two  effects  that  will  decrease  the  potential  speedup:  the  parallelisation  overhead 
inherent  in  the  distribution  of  the  data  and  the  synchronisation  of  the  multiple 
processes  and,  secondly,  the  slowdown  per  processor  when  the  array  length  per 
processor  decreases  because  of  the  latency  of  the  operation.  In  Figure  2  the 
speeds  on  1,8,  and  32  processors  is  displayed  for  the  first  14  kernels  of  program 
mediae. 
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Fig.  2.  Speeds  in  Mflop/s  of  the  first  14  kernels  of  program  mediae  on  1,  8,  and  32 
processors. 

The  Fortran  compiler  uses  heuristics  to  determine  whether  the  computational 
content  of  a  loop  is  sufficient  to  warrant  parallel  processing.  If  not,  the  loop  is 
executed  sequentially.  When  recurrences  are  detected,  the  loop  is  also  executed 
sequentially.  This  is  the  case  with  kernels  11  and  12  representing  first  and  second 
order  recurrences,  respectively.  All  other  kernels  but  one  are  executed  in  parallel. 
For  all  these  kernel  there  turns  out  at  least  some  benefit  in  parallel  execution.  The 
exception  is  the  dotproduct  that  shows  a  lower  performcuice  on  8  processors  in 
parallel  and  is  executed  sequentally  on  32  processors.  It  shows  that  the  heuristics 
used  to  determine  a  sufficient  amount  of  parallelism  basically  are  correct  in  that 
the  parallel  execution  is  not  slower  than  the  sequential  one. 

In  many  cases,  however,  the  speedup  is  not  very  high.  The  inherently  slow 
division  (kernel  6)  and  kernel  14,  the  evaluation  of  a  9“’-degree  polynomial, 
which  have  both  a  large  computational  content  benefit  the  most  while  a  kernel 
like  the  daxpy  operation  (kernel  8)  show  a  speedup  of  only  12%  from  8  to  32 
processors.  Here  also  the  latency  of  the  operation  plays  a  role:  the  array  length  on 
32  processors  is  only  31  elements.  With  this  array  length  the  speed  per  processor 
is  already  15%  lower  than 

In  summary  one  can  conclude  that  the  computational  content  of  a  loop  should 
preferably  not  be  below  10  Hops  to  attain  a  sizable  speedup  at  32  processors. 

Distributed-memory  parallel  dotproduct  From  Figure  2  it  was  clear  that 
the  use  of  the  shared-memory  programming  model  is  not  suited  for  parallel  ex¬ 
ecution  of  the  dotproduct.  We  also  executed  the  dotproduct  with  a  distributed- 
memory  programming  inmodel  using  MPI.  Three  implementations  were  consid¬ 
ered:  a  “naive”  implementation,  in  which  all  partial  sums  are  sent  to  a  root 
processor  which  also  distributes  the  global  sum  back  directly  to  all  other  proces- 
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Fig.  3.  Performance  in  Mfiop/s  of  the  three  distributed-memory  dotproduct  implemen¬ 
tation  on  1-32  processors. 

sors,  a  FORTRAN-implemented  tree  algorithm  for  gathering  the  partial  sums  and 
broadcasting  the  global  sum,  and  an  implementation  based  on  MPI_Reduce  and 
MPI_Broadcast.  The  last  implementation  contains  MPI  functions  that  should 
be  optimised  by  the  vendor  and  perform  at  least  as  good  as  the  Fortran- 
implemented  tree  algorithm.  Figure  3  shows  the  result  of  this  distributed-memory 
dotproduct. 

The  first  observation  that  can  be  made  is  that  the  FORTRAN-based  tree  im¬ 
plementation  and  the  MPI_Reduce/Broadcast  implementation  indeed  are  quite 
close  in  performance.  So,  MPI_Reduce  and  MPUroadcast  are  optimised  commu¬ 
nication  functions.  Both  perform  considerably  better  than  the  naive  implementa¬ 
tion,  especially  for  a  larger  number  of  processors.  The  second  observation  is  that 
the  distributed-memory  version  of  the  dotproduct  scedes  well  with  the  number 
of  processors:  at  32  processors  a  speed  of  3167  Mflop/s  is  attained:  about  100 
Mfiop/s,  including  the  time  lost  in  communication.  So,  the  distributed-memory 
version  is  preferable  by  far  over  the  shared-memory  version  from  a  performance 
point  of  view. 

Point-to-point  communication  The  program  modlh  measures  bandwidth 
and  latency  between  two  processors  using  the  MPI  library  functions  MPI-Send 
and  MPIJleceive  with  message  lengths  varying  from  40-10,000,000  bytes.  This 
covers  the  full  range  of  possibilities:  communication  from  the  primary  cache,  from 
the  secondary  cache,  and  from  the  main  memory.  The  interprocessor  communi¬ 
cation  speed  with  point-to-point  communication  is  not  negligible  in  comparison 
with  the  speed  between  the  local  memory  and  the  CPUs.  Therefore,  it  is  useful 
to  consider  this  full  range  as  it  may  affect  the  communication  patterns  one  wants 
to  use. 
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Fig.  4.  Graph  of  bandwidths  in  point-to-point  message  passing  using  MPI.Send  and 
MPIJlecieve,  Results  for  the  0rigin2000,  the  SGI/Cray  T3E-Classic,  and  the  IBM  SP 
are  shown.  On  the  T3E  the  stream  buffers  were  on. 
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The  same  program  has  also  been  run  on  a  Cray  T3E  Classic,  an  IBM  SP2  and 
a  Hitachi  SR2201.  As  the  cache  sizes  of  these  systems  are  different,  one  might 
expect  to  see  different  behaviour  for  these  systems  as  indeed  is  the  case.  This  is, 
however,  not  only  due  to  the  different  access  speed  in  the  memory  hierarchies. 
In  MPI  the  strategy  in  MPI^end  of  buffering  messages,  or  not,  is  left  to  the 
implementator.  As  it  may  be  assumed  that  different  implementation  decisions 
have  been  maxie  for  different  machines,  observed  differences  in  bandwidth  may 
originate  from  differences  in  local  access  times,  another  message  buffer  strategy 
or  both.  Therefore,  the  best  decision  seems  to  be  to  give  the  bandwidth  as  a 
function  of  the  message  length  and  the  latency  as  derived  from  very  short  mes¬ 
sages  (e.g.,  up  to  400  bytes).  For  these  short  messages  one  may  assume  that  no 
auxiliary  buffering  is  required  and  one  may  obtain  a  fair  idea  of  the  latency  as 
experienced  through  the  software.  In  addition,  this  information  is  important  be¬ 
cause  of  the  frequency  that  messages  of  only  one  data  item  are  exchanged  which 
enables  an  estimate  for  the  slow-down  caused  by  such  messages.  The  bandwidth 
versus  the  message  length  is  shown  in  Figure  4. 

Note  that  the  bandwidth  of  the  Origin2000  is  decreasing  from  about  115 
MB/s  for  sufficiently  long  messages  up  to  2  MB  to  102  MB/s  at  4  MB.  As  already 
mentioned  in  section  2,  the  bandwidth  available  at  the  application  level  is  150 
MB/s,  so  the  bandwidth  found  reasonably  matches  this  figure.  For  messages 
longer  than  4  MB  the  bandwidth  even  drops  to  about  78  MB/s.  We  do  not 
observe  this  behaviour  on  the  other  three  systems.  We  ascribe  the  decreasing 
bandwidth  on  the  Origin  to  the  fact  that  buffer  copies  above  4  MB  do  not  fit  in 
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System 

Bandwidth 

Mbyte/s 

Latency 

US 

SGI  Origin2000 

115.75 

14.6 

SGI/Cray  T3E-Classic 

22.3 

IBM  SP 

104.85 

34.7 

Hitachi  SR2201 

216.69 

29.7 

Table  2.  Maximum  bandwidths  and  latencies  for  the  0rigin2000,  the  SGI/Cray  T3E- 
Classic,  the  IBM  SP,  and  the  Hitachi  SR2201. 


Fig.  5.  Performance  for  y  =  Ax.  Only  the  fastest  Fortran  77  and  the  SGI  library 
routine  are  shown. 

the  secondary  cache  anymore  and  therefore  the  memory  must  be  accessed.  The 
less  than  ideal  MPI  implementation  might  be  at  the  base  of  this  effect.  In  table 
2  we  summarise  the  maximal  bandwidths  and  latencies  for  the  four  systems. 

4.2  Module  2  results 

Of  module  2  we  present  two  programs.  Program  mod2a,  which  measures  the 
speed  of  a  matrix- vector  multiplication  and  inod2e  which  solves  a  large  sparse 
eigen  value  problem  system.  For  the  discussion  of  all  programs  of  module  2  one 
is  referred  to  [3]. 

mod2a,  single- node  In  In  the  single-node  version  problem  sizes  of  n  =  25,  50, 
100,  200,  300,  and  500  are  considered  for  each  of  five  implementations.  For  the 
sake  of  clearness,  we  show  only  the  fastest  of  the  Fortran  77  implementations 
together  with  the  result  of  the  library  version  of  the  BLAS  2  routine  dgemv 
in  Figure  5.  The  implementations  actually  used  are  a  dotproduct,  or  row-wise 
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Order 

Not  unrolled 
Row-wise 
Mflop/s 

4  X  unrolled 
Row-wise 
Mflop/s 

Not  unrolled 
Column-wise 
Mflop/s 

4  X  unrolled 
Column-wise 
Mflop/s 

Librar  y 
versio  n 
Mflop/s 

135.8 

78.2 

181.3 

173.3 

102.3 

■pH 

100 

167.7 

123.7 

225.  6 

200 

184.2 

138.4 

242.3 

234.  5 

300 

186.9 

138.1 

227.  6 

500 

187.1 

77.3 

201.4 

189.  5 

Table  3.  Performances  on  the  0ngin2000  for  y  =  Ax.  Four  different  Fo  RTRAN  77 
implementations  and  the  SGI  libary  version  are  shown. 

implementation,  a  daxpy  or  column-wise  implementation  and  the  four  times 
unrolled  versions  of  these  two  methods.  On  many  systems  the  unrolled  versions 
perform  better  than  their  not  unrolled  equivalents.  This  is,  however,  not  the  case 
on  the  Origin.  The  reason  is  that  the  Fortran  77  compiler  itself  already  unrolls 
loops  where  possible  and  this  is  certainly  so  for  the  simple  inner  loops  used  in 
the  various  not  unrolled  implementations.  For  the  implementations  where  a  hand 
unrolling  is  done  the  compiler  is  not  able  to  generate  code  of  comparable  quality 
and  the  performance  of  the  unrolled  versions  lag  behind  as  shown  in  Table  3. 
So,  a  fairly  obvious  hand  optimisation  does  not  work  out  very  well  here.  The 
lesson  could  be  not  to  do  these  kind  of  optimisations  on  the  Origin  to  give  the 
compiler  a  better  chance  for  automatic  optimisation.  One  of  the  objectives  of 
program  mod2a  is  to  make  users  aware  of  such  facts. 

Note  that  in  the  column-wise  version,  using  daxpy  operations  a  speed  is 
attained  that  is  twice  as  high  as  found  with  program  mod  lac  for  kernel  8  (see 
Table  1).  Within  the  context  of  a  matrix- vector  multiplication  with  the  daxpy 
as  an  inner  loop,  the  compiler  is  able  to  overlap  two  succesive  iterations  of  the 
inner  loop,  thus  winning  a  factor  of  2  in  speed. 

mod2a,  parallel  versions  Of  mod2a  also  a  shared-memory  and  a  distfibuted- 
memory  version  were  executed  to  assess  the  potential  benefit  of  the  paralleli¬ 
sation  in  both  programming  models.  In  Figure  6  the  results  for  the  two  imple¬ 
mentations  is  shown.  It  is  clear  from  the  Figure  that  the  distributed-memory 
version  is  much  faster  than  its  shared-memory  counterpart:  7.3  v.s.  2.7  Gflop/,s 
on  32  processors.  In  the  distributed-memory  implementation  the  data  di.stribu- 
tion  is  such  that  no  data  have  to  be  communicated  between  the  processors.  In 
this  situation  the  distributed-memory  is  preferable.  However,  when  the  trans¬ 
posed  matrix-vector  product  is  performed,  all-to-all  communication  is  required. 
The  overhead  in  sending  messages  turns  out  to  be  so  high  in  this  case  that  the 
shared-memory  version  is  now  faster  then  the  distributed-memory  version:  2.5 
vs.  0.15  Gflop/s  on  32  processors. 

Program  mod2e  In  program  mod2e  the  10  smallest  eigenvalues  of  penta-diagonal. 
symmetric  systems  with  matrix  orders  n  =100,. . .  ,10000  are  computed  by  a  gen¬ 
eralised  Lanczos  iteration  scheme.  In  Figure  7  we  show  the  .s]>eed  per  iteration 
for  the  range  of  system  orders  both  without  and  with  interprocedural  analysis. 
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Fig.  6.  Performance  surface  of  a  parallel  shared-memory  implementation  (left)  and  a 
distributed  memory  implementation  (right)  of  a  matrix-vector  product. 


Fig.  7.  Performance  pet  iteration  for  sparse  eigenvalue  computa,tion  without  and  with 
interprocedural  onuiv'i' 

Figure  7  show>  th.iJ  the  interprocedural  analysis  results  in  a  small  but  consis¬ 
tently  better  perfnrnianf  *■  uver  the  whole  problem  range.  The  difference  becomes 
slightly  larger  for  Uie*-t  ptoblem  size  because  in  this  case  the  floating-point  op¬ 
erations  in  the  gciiei  Lanezos  routine  more  strongly  dominate  the  compu¬ 
tation. 

The  floating-point  ■ 't«T.itions  on  the  diagonals  of  the  matrices  are  typical 
vector  operations  a^  measured  in  program  modlac  and  therefore  the  ker¬ 
nels  from  modlac  sin  mi'i  ,  .i  ''diet  the  speed  of  the  Lanezos  routine  to  a  reasonable 
extent.  The  mix  of  Hu.ttiim-point  operations  as  measured  in  modlac  was  as  fol¬ 
lows; 
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Dotproduct  34.3% 

Kernel  10  25.7% 
axpy  22.9% 

Dyadic  mult.  17.1% 

The  weighted  average  of  the  peak  speeds  of  these  operations  in  the  primary  cache 
is  134.5  Mflop/s.  From  Figure  7  we  see  that  with  interprocedural  analysis  the 
speed  for  the  largest  problem  is  133  Mflop/s  and  without  interprocedural  anal¬ 
ysis  we  find  127  Mflop/s.  This  is  in  excellent  agreement  with  the  speeds  found 
for  the  kernels  of  mediae.  This  consistency  shows  that  in  the  right  context  the 
prediction  of  the  performance  from  kernel  speeds  might  help  to  understand  the 
observed  performance.  The  right  context  is  important  though,  aa  was  demon¬ 
strated  wuth  program  mod2a. 

In  the  present  form  program  mod2e  is  badly  suited  for  parallelisation.  There¬ 
fore  no  parallel  results  are  presented. 

4.3  Module  3  results 

In  module  3  various  programs  are  considered  that  represent  important  classes  of 
applications.  The  programs  have  been  tailored  in  the  sense  that  only  the  essential 
floating-point  parts  have  been  retained  as  this  is  our  main  concern.  However,  the 
first  two  programs  in  this  module  are  designed  to  test  important  I/O  patterns 
to  obtain  an  idea  of  the  I/O  capabilities  of  the  systems  considered.  Again,  we  do 
not  discuss  the  full  range  of  programs  in  this  module.  See  [3]  for  the  complete 
results. 

Most  of  the  programs  in  this  module  have  a  complexity  that  makes  it  dif¬ 
ficult  to  estimate  their  Mflop-rate.  So,  mainly  execution  times  are  reported.  In 
addition,  only  one  of  the  programs  was  amenable  for  parallelisation  (program 
mod3h).  On  the  other  hand,  many  module  3  programs  have  a  complexity  that 
made  it  worthwhile  to  subject  them  to  interprocedural  analysis. 

To  place  the  results  in  context,  we  added  timings  of  two  other  systems:  the 
T3E-Classic  and  the  IBM  RS/6000  SP. 

PDE  programs  In  module  3  three  implementations  of  Elliptic/ Parabolic  PDF 
solvers  are  included,  programs  mod3c  a  Multigrid  solver,  mod3g  a  Fast  Elliptic 
solver,  and  mod3h  a  Block  Relaxation  solver,  respectively.  They  all  solve  the 
same  model  problfui  .i  Laplace  equation  on  the  unit  square.  They  differ  vastly 
in  their  solution  for  this  particular  problem  but  each  method  has  its  own 

virtues  that  mak»  tlicm  more  or  less  complementary.  The  execution  times  are 
given  in  Table  4  .4-  ■  .iii  he  seen  from  the  Table,  a  single  node  of  the  the  T3E 
is  consistently  slew;  those  of  the  IBM  SP  and  the  0rigin2000.  Note  that 
only  in  program  aodic  ■  IBM  SP  is  significantly  faster  tlian  the  0rigin2000, 
although  the  theot'-?,, performance  is  much  higher:  640  vs.  390  Mflop/s. 
Furthermore,  it  tm:,-  a’  that  interprocedural  analysis  gives  a  very  slight  ad¬ 
vantage  over  the  n'am.ii  .uialysis.  In  general,  for  the  programs  of  this  module 
the  effects  of  intepi<><  'O  a.i]  analysis  were  not  large. 

ODE  program  In  pi-main  mod3f  the  problem  of  gas  diffusion  into  a  porous 
medium  is  considered  In  tliis  program  two  gases  with  different  diffusion  coef- 
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mod3c 

seconds 

modSg 

seconds 

mod3h 

seconds 

Cray  T3E-Classic 

2.424 

0.114 

10.083 

IBM  RS/6000  SP 

0.970 

0.083 

3.670 

SGI  0rigin2000 
SGI  0rigin2000 

1.486 

0.065 

2.366 

Interproc.  analysis 

— 

0.062 

— 

Table  4.  Execution  times  for  three  PDE  solvers  on  the  Cray  T3E- Classic,  the  IBM 
RS/6000  SP,  and  the  0rigin2000. 


System 

Execution  time 
seconds 

Cray  T3E-600 

16.003 

IBM  RS/6000  SP 

8.5646 

SGI  Origin2000 

8.7060 

SGI  Origm2000 

Interproc.  cinalysis 

7.6141 

Table  5.  Performances  in  seconds  in  program  mod3f  for  various  systems  (single-node 
performance). 


ficients  are  modeled.  The  implementation  is  such  that  a  time  sequence  of  stiff 
two-point  boundary- value  ODEs  is  solved.  The  timing  results  for  the  program 
are  displayed  in  Table  5.  Table  5  shows  the  same  general  pattern  as  was  found  for 
the  PDEs:  the  T3E  is  notably  slower  than  the  other  two  machines  while  the  IBM 
SP  is  only  marginally  faster  than  the  0rigin2000  with  standard  code  analysis, 
notwithstanding  its  higher  Theoretical  Peak  Performance.  With  interprocedural 
analysis,  the  0rigin2000  is  about  15%  faster  than  with  standard  code  analysis. 

5  Summary  and  future  work 

The  amount  of  information  from  our  experiments  has  been  vast  and,  although  we 
have  discussed  them  to  a  fair  extent,  we  are  sure  that  a  more  extensive  analysis 
would  still  bring  up  new  points  in  the  interpretation.  It  would  almost  certainly 
also  would  give  grounds  for  new  experiments.  In  this  study  we  also  have  refrained 
from  hand-optimisation:  we  just  let  the  compiler  do  the  work  with  the  appro¬ 
priate  complier  options.  Other  subjects  not  considered  but  probably  important 
are:  the  explicit  placement  of  data  on  the  0rigin2000  system  and  the  migration 
of  data  by  the  operating  system  to  the  processor  that  most  uses  them.  On  the 
other  hand,  a  number  of  useful  conclusions  can  be  drawn  from  this  study  of 
which  we  list  the  main  ones  below: 

-  In  many  cases  a  large  proportion  of  the  Theoretical  Peak  Performance  can  be 
attained  when  operating  from  the  primary  cache.  The  performance  with  ac¬ 
cess  from  the  secondary  is  generally  2-3  times  slower,  except  for  the  division 
operation. 
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-  The  experiments  in  program  mediae  showed  that  one  8-byte  operand  can  be 
loaded  or  stored  from/to  the  primary  cache.  From  the  secondary  cache  this 
is  about  one  operand  per  two  cycles. 

-  When  automatic  parallelisation  is  applied,  the  default  choices  whether  or  not 
to  parallelise  a  certain  loop  seem  to  be  adequate  in  most  cases  we  observed. 

-  The  point-to-point  bandwidth  measured  with  MPI  is  about  110  MB /s,  about 
70%  of  the  bandwidth  of  150  MB/s  quoted  by  SGI. 

-  The  automatic  shared-memory  parallelisation  of  codes  generates  a  non-ne- 
gligible  parallelisation  overhead  as  shown  by  program  mod2a,  a  matrix- vector 
multplication.  Compared  with  the  distributed-memory  version  it  gives  a 
large  performance  loss.  On  the  other  hand,  as  soon  as  also  messages  must 
be  exchanged,  the  shared-memory  implementation  is  clearly  faster  than  the 
MPI  version.  The  similar  phenomenon  was  observed  in  the  FFT  program 
mod2f.  Communication  timings  suggest  that  MPI  implementation  we  used 
in  the  present  tests  is  not  optimal. 

-  In  the  rather  small  programs  of  module  3  interprocedural  analysis  generally 
had  a  quite  modest  influence  on  the  execution  time  (5-15%  decrease). 
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Abstract.  JET  is  a  parallel  library  implemented  with  Java  for  parallel 
computing  over  the  Internet.  The  JET  library  is  oriented  to  long-running 
Master/Worker  applications  with  a  coarse-grain  task  distribution.  The 
computation  is  performed  by  Java  applets  that  are  downloaded  through  a  Web 
page.  The  paper  describes  some  internals  of  JET  and  its  mechanisms  to  provide 
support  for  fault-tolerance,  interoperability  with  PVM/MPl  and  the  use  of 
statistics.  The  paper  includes  some  performance  figures  that  were  taken  with 
simple  benchmarks  and  more  complex  applications. 


1.  Introduction 

In  the  last  years  we  have  seen  an  extraordinary  increase  in  the  number  of  machines 
that  are  connected  to  the  Internet,  this  is  estimated  to  continue  with  an  exponential 
growth.  According  to  a  survey  accomplished  by  Network  Wizards  [NetWizards]  in 
January  1998,  29.6  millions  hosts  were  connected  to  the  Internet  (against  16  million 
in  January  1997).  This  mass  of  processors  connected  together  represent  a  very 
significant  processing  power,  with  a  performance  level  of  a  Petaflop  (lO''^). 

In  a  large  percentage  of  their  time,  workstation  machines  and  personal  computers 
are  only  used  to  small  iterative  tasks,  such  as  reading  mail  or  editing  files.  As  was 
remarked  in  [Schrage92]  workstations  remain  idle  in  about  90%  of  their  time. 

The  idea  of  using  this  spare  computational  power  in  computers  that  are  connected 
to  the  Internet  seems  to  be  quite  promising  and  is  getting  an  enthusiastic  acceptance 
within  the  high-performance  computing  community.  Two  main  things  are  required: 

•  appropriate  applications,  that  take  a  long  time  to  execute  and  have  low 

communication  requirements; 

•  an  effective  infrastructure  to  support  the  execution  of  massively  parallel 

applications  in  hundreds  or  thousands  of  computers  geographically  dispersed 
through  the  Internet; 
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The  main  challenge  of  JET  is  to  provide  such  infrastructure.  It  was  implemented  in 
Java  [JavaSoft]  to  provide  the  portability  of  code,  to  solve  the  problem  of 
heterogeneity  of  systems  and  to  allow  the  easy  distribution  of  code  through  the 
machines  that  want  to  volunteer  their  CPU  spare  cycles  for  solving  a  massively 
parallel  application. 

Applications  that  are  good  candidate  programs  to  the  JET  parallel  machine  should 
divide  the  problem  into  small  tasks  to  be  executed  by  different  processors  distributed 
over  the  Internet.  Those  applications  should  be  coarse-grained,  take  a  long  time  to 
execute,  do  not  require  ultimate  performance  and  should  tolerate,  in  some  extent,  the 
low  latency  of  the  network.  There  are  some  quite  important  applications  from  the  field 
of  cryptography  and  mathematics  that  can  be  effectively  executed  with  JET. 


2.  JET  Architecture 

The  applications  that  can  be  executed  with  JET  follow  the  MasterAVorker  paradigm. 
There  is  a  process,  the  Master,  which  is  responsible  for  the  decomposition  of  the 
problem  into  small  and  independent  tasks.  The  tasks  are  distributed  among  the  worker 
processes,  which  are  executing  a  simple  cycle;  receive  a  task,  compute  it  and  send 
back  the  result.  The  results  are  gathered  by  the  Master  process,  which  merges  them  to 
construct  the  final  solution.  Since  every  task  is  independent  from  each  other,  there  is 
no  need  for  communication  between  the  worker  processes. 

JET  is  non-intrusive  to  the  machines  that  access  any  Web  page:  only  those  users 
that  are  willing  to  volunteer  their  CPU  time  will  have  an  applet  working  on  their 
computer  contributing  for  a  JET  computation.  The  users  that  wish  to  volunteer  to  a 
JET  computation  have  to  access  to  a  Web  page  using  a  Java-enabled  browser  and 
follow  a  Web  link.  The  downloaded  Web  page  has  an  inlaid  Java  applet  (Worker 
applet)  which  will  indicate  the  status  of  the  computation  and  communicates  with  the 
JET  Master. 

The  security  features  of  Java  only  allow  the  applets  to  communicate  with  the 
machine  from  where  they  where  downloaded.  Hence,  the  Master  process  has  to  be 
executing  in  the  same  machine  where  the  http  daemon  is  executing.  It  has  a  well- 
known  port  to  all  the  Workers.  The  communication  between  Workers  and  Master  is 
done  through  UDP  sockets.  Although  the  UDP  protocol  does  not  guarantee  the 
delivery  of  messages,  it  provides  a  higher  scalability  and  consumes  fewer  resources 
than  TCP  sockets.  The  communication  layer  of  JET  implements  a  reliable  service  that 
assumes  sequenced  and  error- free  message  delivery. 
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The  JET  library  as  a  server  checkpointing  mechanism  to  assure  the  continuity  of 
the  application  when  there  is  a  failure  or  a  preventive  shutdown  of  the  JET  Server. 
The  critical  state  of  the  application  is  saved  periodically  in  stable  storage  in  some 
portable  format  that  follows  its  resumption  later  in  the  same  or  different  machine. 

To  tolerate  the  loss  of  the  stateless  worker  applets  the  JET  library  maintains  a  task- 
reconfiguration  scheme.  The  library  keeps  the  jobs  that  have  been  sent  to  each  worker 
applet.  If  one  applet  fails  or  withdraws  from  the  virtual  machine,  the  only  part  of  the 
computation  that  is  affected  is  the  task  it  was  being  executed.  Re-allocating  that  task 
to  another  worker  would  reproduce  the  lost  work  without  changing  the  ultimate 
outcome  of  the  computation.  However,  for  those  applications  with  very  long-running 
tasks  it  is  important  to  save  intermediate  states  of  the  task  execution  in  the  worker 
applets. 

Implementing  client  checkpointing  is  not  trivial  in  a  Java  applet  since  it  cannot 
write  to  the  local  disk.  Thereby,  the  only  way  he  had  to  implement  the  client 
checkpointing  was  to  send  the  checkpoint  data  over  a  socket  stream  to  the  associated 
JET  Master.  When  a  Worker  applets  withdraws  from  the  virtual  machine  the  last 
checkpoint  of  its  task  is  distributed  to  another  worker. 

The  JET  machine  needs  to  motivate  the  Web  surfers  to  participate  in  the 
computation,  and  even  on  interesting  applications  is  necessary  to  increase  their 
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enthusiasm.  The  JET  Server  gathers  information  about  the  computation  done  by  each 
volunteer  and  creates  a  statistics  module  with  several  rankings.  The  statistical 
information,  organized  by  several  categories  (e.g.  users,  countries,  operating  systems, 
processors  and  browsers)  ranks,  is  published  on  the  Web.  The  users  are  also  able  to 
create  teams.  These  rankings  create  a  healthy  competition  between  users  and  keeps 
their  interest  to  participate  in  the  computation. 

JET  is  not  restricted  to  Web-based  computation.  The  use  of  some  existing  parallel 
libraries  and  computer  resources  is  also  be  possible.  The  basic  idea  is  to  allow 
existing  clusters  of  machines  running  PVM  or  MPI  to  inter-operate  with  JET 
computations. 

To  achieve  this  we  have  used  two  Java  bindings  developed  in- our  research  group 
for  Windows  versions  of  the  MPI  (WMPI)  [WMPI]  and  PVM  (WPVM)  [Alves95] 
libraries.  The  big  master  process  of  the  PVM/MPI  cluster  only  needs  to  create  an 
instance  of  a  class  that  implements  a  bridge  between  the  cluster  and  the  JET  Master. 
The  jobs  are  fetched  by  this  object  and  placed  in  an  internal  buffer  of  the  PVM/MPI 
big  master,  which  is  responsible  to  distribute  them  among  the  workers  of  the  cluster. 
The  results  are  gathered  by  the  big  master  of  the  cluster  and  passed  to  the  bridge 
object  to  be  sent  to  the  JET  Master. 


3.  Performance  Results 

In  this  section,  some  performance  results  of  JET  are  presented.  These  measurements 
were  taken  in  a  heterogeneous  environment  of  NT  and  Solaris  Workstations.  The 
workers  were  running  on  6  PentiumPro-based  machines,  all  of  them  running  at  200 
MHz,  with  the  NT  Workstation  operating  system.  Two  of  those  machines  are  dual¬ 
processor;  hence,  in  overall  the  performance  results  were  taken  with  8  processors.  The 
Master  process  was  running  on  a  Sun  Ultra-Sparc  machine  running  Solaris  V4.0.  The 
machines  were  connected  through  a  non-dedicated  10  Mbit/sec  Ethernet  network.  The 
Worker  applets  were  executed  through  the  Netscape  Communicator  4.0;  the  Master 
process  was  executed  with  JDK  1.1. 


3.1  Simple  Benchmarks 

The  relative  speedup  of  the  NQUEENS  application  with  14,  15  and  16  queens  is 
presented  in  Figure  2.  In  this  example,  the  speed  up  was  calculated  with  the  parallel 
version  of  the  algorithm  running  on  one  processor.  The  achieved  results  are  quite 
good:  with  8  processors  the  speed  up  was  7.66,  7.36  and  7.24  with  14,  15  and  16 
queens  respectively.  The  reason  why  the  speedup  decreases  with  the  increase  of  the 
number  of  queens  is  due  to  small  differences  of  performance  of  the  processors  that  are 
more  visible  with  larger  jobs.  Hence,  the  time  that  the  JET  machine  has  to  wait  for  the 
last  job  increases  with  the  size  of  the  jobs.  Although  the  task  distribution  of  JET  has 
intrinsic  load-balancing  behavior,  they  can  not  tolerate  these  fine-grain  differences. 
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Fig.  2.  Relative  speedup  of  NQUEENS  (14, 15  and  16  queens). 

The  EP-NAS  application,  which  makes  part  of  the  NAS  benchmark  suite 
[Bailey93],  was  also  used  as  benchmark.  Due  to  the  temporary  unavailability  of  the 
dual-Pentium  machines,  the  results  were  taken  in  just  four  processors.  The  speedup 
presented  in  Figure  3  was  calculated  with  a  serial  Java  version  of  the  program. 
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Fig.  3.  Relative  Speedup  of  EP-NAS. 
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Although  EP-NAS  problem  has  a  significant  amount  of  floating-point  calculations 
the  performance  of  Java,  and  therefore  JET  performance,  was  not  affected  since  the 
speedup  once  again  is  quite  good;  3.87  with  4  processors. 

Although  the  speedup  results  are  always  dependent  from  the  characteristics  of  each 
application,  these  results  show  that  JET  does  not  degrade  the  performance  with  the 
Increasing  number  of  processors. 


Fig.  4.  Relative  speedup  of  TSP  (20  cities)  with  and  without  additional 
information. 


The  next  experiment  was  made  with  an  application  that  has  different  characteristics 
from  the  last  two.  In  the  Travel  Salesman  Problem  (TSP),  additional  information  was 
passed  asynchronously  to  the  Workers,  which  is  enabled  by  the  JET  library.  Each 
Worker  is  informed  if  a  shorter  path  (new  minimum  path)  was  found  by  another 
Worker  every  time  a  result  with  a  new  minimum  arrives  to  the  Master.  A  version 
without  this  capability  (in  this  case  each  worker  only  knows  its  minimum)  also  was 
implemented.  Figure  4  presents  the  relative  speedup  achieved  by  the  two  versions 
when  searching  on  a  20  cities  map. 

The  application,  due  to  its  intrinsic  characteristics,  does  not  scale  as  well  as  the 
previous  examples.  The  version  that  does  not  use  the  JET  library  capability  of  pass 
information  additional  information  to  the  workers  does  not  scale  so  well  when 
compared  with  the  other  one. 
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3.2  Complex  Applications 

Besides  these  simple  benchmark  applications,  a  few  more  complex  applications 
were  ported  to  JET:  a  program  to  find  Mersenne  Primes  [Mersenne]  and  a  RC5  (64- 
bit  key)  encryption  algorithm  [Rivest95]  crack  application. 

The  RC5  encryption  attack  is  an  example  of  a  embarrassingly  parallel  application. 
The  jobs  are  a  set  of  keys  to  be  tested,  by  using  them  to  decrypt  the  message  and  test 
is  if  it  is  the  correct  one.  The  result  only  has  to  indicate  if  the  correct  key  was  in  the 
tested  set  and  the  correct  key.  The  key-space  to  be  searched  is  enormous  and  a 
concerted  world  effort  [Bovine]  is  on  the  way  to  crack  this  code.  The  JET 
computation  is  a  candidate  to  join  this  effort  and  use  Web-based  computation  to  help 
finding  the  correct  key. 
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Fig.  5.  Speedup  of  the  RC5  64-bit  encryption  attack  application. 


Figure  5  shows  the  speedup  achieved  by  JET  when  computing  this  application.  The 
speedup  was  calculated  *iih  i  serial  Java  version  of  the  application. 

The  Merssene  Primes  Scjfi.h  application  was  tested  with  two  versions,  the 
difference  between  these  .ci'i.'ns  is  the  order  by  which  the  numbers  are  searched. 
The  version  which  starts  me  higher  number  has  a  better  speedup  (Figure  6).  This 
fact  occurs  due  to  the  belief  usk  distribution  achieved  by  JET.  The  size  of  the  jobs 
grows  exponentially  with  ibe  increase  of  the  number  to  be  searched.  If  the  biggest 
task  is  the  last  to  be  assigned,  then  all  the  other  processes  will  stall  waiting  for  that 
task  to  be  ended.  However,  if  the  largest  task  is  the  first  one,  all  the  other  processes 
will  be  working  (on  other  ta.sk,s).  At  the  final  of  the  computation,  the  tasks  are  so 
small  that  the  time  to  wait  for  ihe  end  of  the  last  job  is  very  small. 
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Fig.  6.  Speedup  of  the  Mersenne  Primes  search  application. 

Figure  6  presents  the  speedup  of  the  Mersenne  Primes  search,  relative  to  a  serial 
Java  version  of  the  application.  As  it  can  be  seen  the  version  with  decreasing  tasks 
size  scales  better,  this  shows  the  importance  of  a  correct  task  distribution. 


4.  Related  Work 

In  the  past  years  several  projects  have  confirmed  the  ability  of  the  Internet  for- 
massiveiy  parallel  computing.  In  [Silverman91]  was  presented  an  example  of 
massively  distributed  computing  over  the  Internet.  It  used  400  machines  that  were 
located  at  research  institutes  of  three  different  continents.  The  problem  was  the 
factorization  of  a  KXi-hits  integer  used  by  the  RSA  cryptographic  algorithm.  Each  site 
has  received  by  electronic  mail  a  set  of  polynomials  to  independently  work  with.  It 
took  275  MIP-Years  to  pcrlorm  one  of  the  computations.  The  project  has  been  active 
[RSAFact]  since  then  and  ihc  factoring  of  130-bits  number  was  successfully  solved  in 
November  1996.  To  ihis  problem  a  collection  of  CGI  scripts  were  used  to 
automate  and  coordinaic  ihc  flow  of  tasks  within  the  distributed  network  of  Web 
sieving  clients. 

Another  representative  example  is  the  Gordon  Bell  Prize  of  1992  big  winner:  a 
collection  of  192  heterogeneous  machines  scattered  around  the  United  States  was 
used  to  solve  a  simulation  of  polymer  chains  [Karp93].  The  outstanding 


'  One  MlP-Year  is  referred  as  ihc  amount  of  work  performed  by  I-MIP  machine  running  for 
one  year. 
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price/performance  ratio  achieved  granted  the  prize  to  the  project.  In  [NiepIocha96]  is 
presented  another  project  which  had  used  four  supercomputers  located  in 
geographically  dispersed  computing  centers  of  the  United  States  connected  together  to 
compute  a  molecular  simulation  program.  The  speedups  achieved  were  quite  good. 
Another  interesting  example  was  presented  in  [Strumpen93].  This  paper  describes  the 
use  of  800  workstations  to  solve  a  problem  that  involved  molecular  sequence  analysis. 
The  machines  were  dispersed  through  31  different  local  area  networks  and  5 
continents. 

More  recently,  there  were  other  remarkable  examples  of  Internet  parallel 
computing.  For  instance,  in  February  of  1997  a  team  of  researchers  using  3500 
computers  spread  across  Europe  was  able  to  crack  a  RSA  code  of  48  bits  in  less  than 
two  weeks  [Lash97]. 

In  January  27“’  of  1998  a  Californian  19  year-old  student  found  the  37“’  Mersenne 
Prime  (the  world’s  largest  known  prime)  on  behalf  of  the  GIMPS  project  (Great 
Internet  Mersenne  Prime  Search)  [GIMPS],  The  computation  comprised  about  4000 
users  that  volunteer  their  machines  to  that  computation  and  the  lucky  man  was  Roland 
Clarkson,  that  have  contributed  with  his  200  MHz  Pentium  computer  for  46  days,  in 
part-time,  to  prove  the  number  prime. 

Finally,  in  October  19“’  of  1997,  it  was  announced  that  one  of  the  largest 
distributed-computing  effort  ever  seen,  involving  tens  of  thousands  of  computers 
connected  to  the  Internet:  the  Bovine  cooperative  effort  [Bovine]  decrypted  a  message 
encoded  with  RSA  Labs’  56-bit  RC5  encryption  algorithm.  The  search  took  250  days 
of  massive  Internet  computing;  the  medium  computational  power  was  equivalent  to 
14,685  Intel  Pentium  Pro  200  processors.  This  time  the  lucky  man  that  found  the  right 
key  was  Peter  Stuer  from  Belgium. 

All  these  examples  demonstrate  that  the  use  of  worldwide-distributed  computing 
resources  is  feasible  to  perform  large  computations. 

In  the  latest  years,  the  exploitation  of  geographically  distributed  machines  for  parallel 
computing  has  become  a  clear  trend.  A  considerable  number  of  project  have  been 
proposed:  Globe  [Steen95],  Legion  [Grimmshaw96],  Globus  [Foster96],  Atlas 
[Baldeschweiler96],  ParaWeb  [Brecht96],  Popcorn  [CamieI96],  Charlotte 
[Baratloo96],  DAMPP  [Vanhelsuwe97],  IceT  [Gray97],  Javelin  [Cappelo97], 
JavaParty  [Philippsen97],  Albatross  [Bal97],  among  others. 

Some  of  these  projects  were  also  developed  in  Java:  Javelin,  Popcorn,  DAMPP, 
Charlotte,  JavaParty,  Atlas,  ParaWeb,  IceT  and  Albatross.  Most  of  these  systems  lack 
some  support  of  fault-tolerance,  scalability,  support  for  interoperability  with  other 
existing  tools  and  a  module  of  statistics  to  motivate  Internet  users  to  participate  in 
Web-based  computations. 
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Although  there  are  some  differences  between  these  projects  and  JET,  all  of  them 
try  to  prove  the  idea  that  Java  can  be  used  for  parallel  computing  over  the  Internet.  It 
would  be  interesting  that  some  standardization  protocols  could  be  developed  to  allow 
the  cooperative  execution  of  JET  and  any  of  these  Java-based  parallel  tools.  This  way 
the  number  of  machines  working  out  on  the  same  global  computation  could  be 
extended. 


5.  Final  Conclusions 

JET  can  be  a  massively  parallel  machine.  It  may  compromise  several  hundreds  of 
machines  connected  to  the  Internet.  Each  machine  that  takes  part  on  a  JET 
computation  is  absolutely  ubiquitous:  it  just  requires  a  Java-enabled  browser.  The  user 
can  volunteer  his  CPU  spare  cycles  just  by  clicking  in  some  URL  of  a  Web  page.  A 
Java  applet  is  downloaded  to  that  machine  and  executes  some  independent  tasks  of  a 
number-crunching  application.  JET  is  a  really  inexpensive  parallel  computing 
platform:  it  is  based  on  the  idea  of  “scavenging”  the  idle  CPU  cycles  of  machines  that 
are  connected  to  the  Internet,  reusing  the  existing  computing  facilities. 

Some  built-in  features  provide  support  for  fault-tolerance  on  the  JET  computation, 
interoperability  with  PVM  and  MPI  libraries  and  the  usage  of  statistics  to  keep  the 
motivation  and  enthusiasm  of  the  user  volunteers. 

The  first  performance  results  of  JET  with  simple  benchmarks  were  very  promising. 
When  complex  application  were  ported  to  JET  the  results  achieved  have  confirmed 
the  ability  of  JET  to  be  used  for  massively  parallel  computing  over  the  Internet. 
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Abstract.  Applying  fast  scientific  computing  algorithms  to  large  prob¬ 
lems  presents  a  difficult  engineering  problem.  VVe  describe  a  novel  archi¬ 
tecture  for  addressing  this  problem  that  uses  a  robust  client-server  model 
for  interactive  large-scale  linear  algebra  computation. 

We  discuss  competing  approaches  and  demonstrate  the  relative  strengths 
of  our  approach.  By  way  of  example,  we  describe  MITMatlab,  a  power¬ 
ful  transparent  client  interface  to  the  linear  algebra  server.  With  MIT¬ 
Matlab.  it  is  now  straightforward  to  implement  full-blown  algorithms 
intended  to  work  on  very  large  problems  while  still  using  the  powerful 
interactive  and  visualization  tools  that  Matlab  provides.  We  rdso  examine 
the  efficiency  of  our  model  by  timing  selected  operations  and  comparing 
them  to  commonly  used  approaches. 


1  Introduction 

We  describe  a  novel  architecture  for  a  “linear  algebra  server”  that  operates  on 
very  large  matrices.  Matrices  are  created  by  the  server  and  distributed  across 
many  machines  or  processors.  Operations  take  place  automatically  in  parallel. 
The  server  includes  a  general  communication  interface  to  clients  and  is  extensible 
via  a  robust  package  system. 

We  are  motivated  by  three  observations.  First,  many  widely-used  algorithms 
in  machine  learning,  differential  equations,  simulation,  etc.  can  be  realized  as 
operations  on  matrices.  Second,  it  is  vital  to  be  able  to  test  new  ideas  quickly  in 
an  interactive  setting.  Finally,  algorithms  that  appear  promising  on  small  data 
sets  can  fail  on  large  problems  and  it  would  be  helpful  to  have  a  tool  that  easily 
enables  experimentation  on  large  problems. 

Common  approaches  suffer  from  several  difficulties.  Interactive  prototyping 
environments  such  as  Mathematica,  Maple,  Octave,  and  Matlab  exist;  however, 
they  often  fail  to  work  well  on  large  problems.  Linear  algebra  libraries  designed 
to  w'ork  on  large  problems  abound;  how'ever,  they  involve  steep  learning  curves. 
Further  they  are  ty].ucally  not  interactive,  requiring  that  applications  be  written 
in  a  compiled  language,  such  as  C-H-f  or  Fortran.  This  is  a  burden  for  users  who 
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simply  want  a  library’s  functionality  and  for  programmers  who  wish  to  extend 
it. 

We  address  these  problems  directly.  Like  standard  libraries,  our  system  en¬ 
capsulates  basic  functionality:  however,  by  modeling  the  system  as  a  server,  we 
allow  for  on-the-fly  interaction  with  arbitrary  user  interfaces.  Further,  the  server 
is  a  self-contained  application,  so  we  are  able  to  extend  it  at  run-time. 

In  this  paper,  we  show  that  our  model  opens  several  possibilities.  We  briefly 
describe  standard'approaches  in  Section  2  before  describing  the  Parallel  Prob¬ 
lems  Server  itself  in  Section  3.  We  detail  its  architecture,  focusing  on  its  exten¬ 
sibility.  Section  4  describes  MITMatlab,  a  system  that  enables  users  to  compute 
interactively  with  very  large  data  sets  directly  from  within  Matlab,  We  then 
report  on  the  results  of  some  performance  experiments  in  Section  5.  Finally,  we 
conclude,  discussing  further  extensions  to  the  system. 

2  Standard  Approaches 

2.1  Linear  Algebra  Libraries 

For  many  compute-intensive  tasks,  the  best  way  to  maximize  performance  is 
to  use  a  library.  For  example,  optimized  versions  of  LAPACK  [1]  exist  that 
outperform  simitar  code  written  in  a  high-level  programming  language  (thanks 
primarily  to  native  implementations  of  the  BLAS).  For  distributed  memory  ar¬ 
chitectures,  vendor-optimized  libraries  (e.g.  Sun's  SSL  and  IBM’s  ESSL)  coex¬ 
ist  with  public  domain  offerings  such  as  ScaLAPACK  [5],  PARPACK  [11]  and 
Petsc  [4]  [9], 

Each  of  these  libraries  has  its  own  idiosyncratic  interface  and  assumptions 
about  the  types  and  distributions  of  data  allowed.  It  is  often  a  major  program¬ 
ming  effort  to  incorporate  library  routines  into  an  application. 

2.2  Interactive  Systems 

The  power  of  prototyi.iing  systems  like  Maple,  Matlab,  Mathematica  and  Octave 
is  that  they  are  interactive.  It  is  straightforward  for  both  seasoned  programmers 
and  relatively  naivr  users  to  develop  algorithms  and  to  visualize  results  from 
such  algorithms.  I  niortunately.  while  these  tools  work  well  for  small  problems, 
they  are  often  inadcciuati-  for  production-level  data. 

There  have  been  main  attempts  to  extend  prototyping  tools  in  order  to  make 
them  work  in  parallel  with  large  data  sets.  Here,  we  focus  on  systems  that  add 
parallel  features  to  .Matlab,  a  widely-used  scientific  computing  tool. 

Both  MultiMatlab  from  t’ornell  University  [13]  and  the  Parallel  Toolbox  for 
Matlab  from  Wake  Forest  University  [10],  make  it  possible  to  manage  Matlab 
processes  on  different  machines.  Matlab  is  extended  to  include  send,  receive  and 
collective  operations  .so  that  separate  Matlab  processes  can  communicate.  In 
short,  these  approaches  implement  traditional  message  passing  with  Matlab  as 
the  implementation  language. 
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fashion  and  managed  among  worker  processes,  which  may  live  on  different  ma¬ 
chines.  Currently  we  support  row  and  column  distributed  dense  arrays,  column 
distributed  sparse  arrays,  and  replicated  arrays  in  single  precision.  Communica¬ 
tion  and  synchronization  among  the  workers  is  accomplished  using  the  MPI  [8] 
message  passing  library.  This  is  a  standard  library  available  on  a  wide  range 
of  platforms;  it  is  currently  the  most  portable  way  to  develop  applications  on 
distributed  memory  computers. 


Machine! 


Machine2 


Machincn 


MPI  Layer 


rTET 


Server  Workers 


Workers 


Fig.  1.  The  General  Oi'ganization  of  the  Parallel  Problems  Server.  The  server 
process  provides  an  interface  to  any  client  that  implements  its  communication  protocol. 


3.2  Communication  and  Extensibility 

We  use  the  client-server  model  in  two  ways.  First,  there  is  a  protocol  for  commu¬ 
nicating  with  clients,  .lust  as  importantly,  there  is  a  separate  plug-in  architecture 
that  allows  for  straightforward  run-time  extensibility  of  the  PPServer. 


The  Client  Interface  While  we  believe  that  servers  are  crucial,  they  remain 
only  academic  oddities  without  useful  clients.  HTTP  servers  are  useful  but  they 
are  much  more  useful  when  powerful  browsers  exist.  Therefore,  it  is  important 
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Compilers  for  Mat-lab  are  also  an  active  area.  Both  the  CONLAB  system  from 
the  University  of  Umea  [7]  and  the  FALCON  environment  from  the  University 
of  Illinois  at  Urbana-Champaign  [3][12]  translate  Matlab-like  languages  into  in¬ 
termediate  languages  for  which  high  performance  compilers  exist.  For  example, 
FALCON  compiles  Matlab  to  Fortran  90  and  pC+H-.  Sophisticated  analyses  of 
the  Matlab  source  are  performed  so  that  efficient  target  code  is  generated. 

Both  of  these  approaches  have  merits;  however,  it  is  our  claim  that  they 
do  not  adequately  address  the  issues  we  have  raised.  The  former  approach  is 
too  involved  for  the  naive  user  and  the  latter  approach  sacrifices  direct  interac¬ 
tion  with  the  computation  and  includes  an  edit-compile-run  cycle  that  increases 
development  time. 

3  The  Parallel  Problems  Server 

The  Parallel  Problems  Server  (PPServer)  combines  many  aspects  of  the  ap¬ 
proaches  we  have  described  so  far.  Like  standard  linear  algebra  packages,  the 
PPServer  neatly  encapsulates  basic  functionality;  however,  because  it  is  a  server 
with  a  general  communication  protocol,  interaction  with  arbitrary  programs 
(with  their  own  u.ser  interfaces)  is  possible.  Also,  the  server  implements  a  ro¬ 
bust  protocol  for  accessing  compiled  libraries.  Thus,  extending  the  functionality 
of  the  PPServer  is  a  simple,  modular  task. 

3.1  The  Client-Server  Model 

The  client-server  model  is  ubiquitous.  There  are  HTTP  servers  that  allow  access 
to  data  via  the  World  Wide  Web  and  database  servers  that  admit  access  to 
specially  indexed  data.  Because  these  servers  implement  robust  protocols  for 
communicating  the  information  they  provide,  it  is  possible  to  build  useful  clients, 
such  as  web  browsers. 

We  believe  that  this  model  is  also  a  useful  one  for  scientific  computation. 
First,  there  is  no  need  to  force  a  client  to  operate  in  parallel  by  endowing  it 
with  communication  primitives;  rather,  such  communication  remains  implicit. 
As  a  result,  the  user  is  not  responsible  for  managing  data  among  various  pro¬ 
cesses.  The  user  simply  issues  the  client's  standard  commands;  these  are  then 
transparently  executed  on  multiple  machines. 

Secondly,  there  is  no  need  to  use  the  client  as  the  computational  engine.  While 
this  has  the  possible  short-term  disadvantage  of  the  server's  functionality  being 
different  than  the  client's,  we  gain  extremely  high  performance.  W'e  are  free  to  use 
the  fastest  distributed  memory  implementations  of  the  algorithms  that  we  need. 
Furthermore,  we  are  not  required  to  use  the  client’s  data  representation.  For 
example,  Matlab  uses  double  precision  numbers.  For  the  very  large  operations 
that  concern  us,  it  often  preferable  to  use  single  precision,  gaining  significant 
time  and  space  advantages  when  accuracy  is  not  a  concern. 

A  high-level  view  of  our  implementation  of  the  PPServer  is  shown  in  Fig¬ 
ure  1,  Clients  make  requests  of  the  server.  Data  are  created  in  a  distributed 
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that  the  client  interface  be  simple  to  use  but  powerful  enough  to  allow'  for  arbi¬ 
trary  operations. 

The  PPServer  uses  standard  Unix  sockets  for  client  communication.  The  pro¬ 
tocol  is  straightforw'ard.  A  client  sends  a  request,  consisting  of  a  command  and 
arguments.  A  command  is  a  string,  naming  a  function.  Functions  may  request 
data  or  the  loading,  saving,  or  creating  of  data.  Furthermore,  they  may  require 
that  specific  operations  be  performed  on  already  existing  data  or  that  library 
extensions  to  be  included  wdth  the  server.  Arguments  are  lists  of  characters,  inte¬ 
gers  and  real  numbers.  Once  a  command  ha.s  been  completed,  it  is  acknowledged 
with  a  message  from  the  server  that  includes  any  errors  and  returned  values. 

A  C-b-f  library  (and  source)  is  provided  that  implements  this  protocol,  in¬ 
cluding  automatic  conversion  between  standard  C/C-f-f-style  data  types  and  a 
form  suitable  for  transmission  to/from  the  server.  Clients  need  only  provide  a 
suitable  wrapper  for  these  functions. 

The  Server  Interface  The  PPServer  is  extensible  (see  Figure  2).  It  includes  a 
robust  function  interface  using  C-b-b  objects.  New  functions  are  defined  using  this 
interface.  These  new  functions  are  compiled  into  dynamically  loadable  libraries, 
dubbed  “packages”  and  loaded  on  demand.  Each  package  is  its  own  name  space, 
so  new  functions  can  be  loaded  “on  top”  of  others,  hiding  functions  of  the  same 
name  in  other  packages.  Like  the  PPServer  itself,  package  functions  use  MPI. 
These  functions  enjoy  access  to  the  ba,sic  functionality  of  the  Server,  including 
direct  access  to  data  and  the  ability  to  execute  all  the  same  commands  that  are 
available  to  clients,  including  those  in  other  packages. 

Figure  3  shows  the  code  for  a  sample  package.  It  contains  one  function  sumall 
that  sums  the  elements  of  a  distributed  matrix.  This  example  shows  the  mecha¬ 
nisms  for  extracting  input  arguments,  accessing  the  elements  of  the  matrix,  and 
returning  results  to  the  client.  With  only  a  handful  of  exceptions,  all  current 
server  functionalit>  i-  written  in  this  way. 

W'e  have  u.sed  i  h-  I’PS.Tver  a.s  the  core  of  several  applications,  implementing 
packages  that  proM  i.-  arr.>ss  to  ARPACK.  SCALAPACK  and  S3L,  Sun’s  opti¬ 
mized  version  of  \  L  M’AC’K.  The  functions  in  the  packages  are  merely  short 
wrappers  for  the  un-bTlv  iiii  functions  provided  by  the  libraries. 

Portability  Th<-  u'*  *  'i,<iidard  C-b-b  and  MPI  has  allowed  us  to  develop  a 

system  that  is  liich!'  i  ■  n  »l’le.  Although  the  PPServer  was  originally  developed 
on  a  network  of  li.m-  ' rp-  multiprocessors  from  Sun  Microsystems,  we  have 
been  able  to  port  n  '  <  luster  of  SMPs  from  Digital  Equipment  Corporation 

with  minimal  effort  \\  -  « r.  currently  w'orking  on  a  port  to  Pentium-driven  Linux 
systems. 


3.3  Other  Client-Server  Models 

There  have  been  pre\  ion-  library  systems  that  implement  a  similar  model.  Both 
RCS  [2]  and  Netsolve  [b]  an  as  fast  back-ends  for  slower  clients.  In  their  model. 
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Scalapack  I  I  S3L 


Libraries 

Computational  & 
Interface  Routines 

Packages 


Fig.  2.  Extending  the  PPServer.  A  client  communicates  with  the  PPServer  using 
a  simple  command-argument  protocol.  The  Server  itself  uses  a  "package’"  mechanism 
to  implement  aU  but  its  most  basic  functions.  New  functionality  can  be  added  to  the 
PPServer  and  managed  in  a  reasonable  way.  (S3L  is  Sun's  optimized  version  of  some 
ScaLAPACK  routines) 
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void  sumallCPPServer  iitheServer,  PPArgList  ftinArgs,  PPArgList  ftoutArgs) 

{ 

//  Get  the  matrix  identifier  that  was  passed  in 
PPMatrixID  srcID=*(inArgs [0] ) ; 

//  Make  sure  that  we’re  passing  in  a  dense  matrix 
if (ItheServer .isDense(srcID) )  { 

//  Return  the  corresponding  error 

outArgs.addErrorCBADINPUTARGS, "Expecting  a  Dense  Matrix"); 

outArgs.add(O) ; 

return; 

} 

//  Get  a  pointer  to  the  actual  matrix 

PPDenseMatrix  *src  =  (PPDenseMatrix  *)  theServer.getData(srcID) ; 
float  sum=0,  answer; 

//  Find  the  local  sum  of  all  of  the  elements 
forCint  i=0;i  <  src->numRows 0 ; i++) 
for(int  j=0;j  <  src->numCols() ; j++) 
sum+=src->get(i, j) ; 

//  Add  the  local  sums  to  find  the  global  sum 

MPI_AllReduce(&sum,fcanswer.l,MPI_FLOAT,MPI_SUM,MPI_COMM_WORLD) ; 

//  Return  eui  error  code 
outArgs . addNoError () ; 

//  Return  the  result  to  the  client 
outArgs. add(answer) ; 

} 

//  Register  this  function  to  the  server 

extern  "C”  PPError  ppinitialize(PPServer  fttheServer) ; 

PPError  ppinitialize(PPServer  fttheServer) 

{ 

theServer .  addPPFunct  i on  ( " small " ,  small ) ; 
retum(NDERR) ; 

} 

Fig. 3.  A  Sample  Server  Extension.  This  code  is  essentially  complete  other  than  a  few 
header  files 
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clients  issue  requests,  arguments  are  communicated  to  the  remote  machine  and 
results  sent  back.  Clients  have  been  developed  for  Netsolve  using  both  Matlab 
and  Java. 

Our  approach  to  this  problem  is  different  in  many  respects.  Our  clients  are 
not  responsible  for  storing  the  data  to  be  computed  on.  Generally,  data  is  created 
and  stored  on  the  server  itself:  clients  receive  only  a  '‘handle”  to  this  data  (see 
Figure  4  for  an  example).  This  means  that  there  is  no  cost  for  sending  and 
receiving  large  datasets  to  and  from  the  computational  server.  Further,  this 
approach  allows  computation  on  data  sets  too  large  for  the  client  itself  to  even 
store. 

We  also  support  transparent  access  to  server  data  from  clients.  As  we  shall 
see  below,  given  a  sufficiently  powerful  client,  PP.Server  variables  can  be  created 
remotely  but  still  be  treated  like  local  variables. 

Both  Netsolve  and  RCS  assume  that  the  routines  that  perform  needed  com¬ 
putation  have  already  been  written.  Through  our  package  system  we  support 
on-the-fly  creation  of  parallel  functions.  Thus,  the  server  is  a  meeting  place  for 
both  data  and  algorithms. 


Machine! 


Machincn 


Matlab  5 


Server 


Workers 


Workers 


Fig.  4.  MITMatlah  Use  of  the  PPSer\'er  bv  Matlab  is  almost  completely 

transparent.  PPServei  >4r.4M.s  remain  tied  to  the  server  itself  while  Matlab  receive.s 
handles  to  the  data  ■  -,1...  .Matlab  scripts  and  Matlab  s  object  and  typing  mecha¬ 
nisms,  functions  using  I’i’'-  imt  variables  invoke  PP.Server  commands  implicitly. 


352 


VECPAR  '98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


4  MITMatlab 

Using  the  client  interface,  we  have  implemented  a  Matlab  front  end,  called  MIT¬ 
Matlab.  At  present,  we  can  process  gigabyte-sized  sparse  and  dense  matrices 
“within”  Matlab,  admitting  many  of  Matlab ’s  operations  transparently  (see  Fig¬ 
ure  5).  By  using  a  client  a,s  the  user  interface,  we  take  advantage  of  whatever 
interactive  mechairisms  are  available  to  it.  In  Matlab ’s  case,  we  inherit  a  host 
of  parsing  capabilities,  a  scripting  language  and  a  host  of  powerful  visualization 
tools. 

For  example,  we  have  implemented  BRAZIL,  a  text  retrieval  system  for  large 
databases.  BRAZIL  can  process  queries  on  a  million  documents  comprised  of 
hundreds  of  thousands  of  different  words.  Because  of  Matlab ’s  scripting  capabil¬ 
ities,  little  functionality  had  to  be  added  to  the  server  directly;  rather,  most  of 
BRAZIL  was  “written”  in  Matlab. 


|»  fl;=raiidii(512,512*p);  a2=ones(512*p,512); 

ni=spraiid(10<)004000'*'p,0.01); 

»  whose 

Your  variables  are: 


Name  Size 

Bytes 

Class 

a 

512  x512p 

1048576 

ddense  array 

a2 

512px512 

1048576 

ddense  array 

m 

10000  xlOOOp 

810176 

dsparse  array 

Gimd^^  is  6M5^el^CTts  using  2907328  bytes 

4.<)000  ^;OOOOr^,<K^  "  ^ 

-  ^4H)00 


>>  e=eig(a);piot(e,’*’);^^([  -30  30  -30  30]);axis(  ’square’) 

» [nAv]sswis(iii^;5* 

anss 

7.7153  7.7342  7.7447  7.7831  16S842 


!»  idseye(1000*p)pcB  cnmsnni(id,l);yscunisum(x,l); 

i»  imasescfy+yO 


Fig.  5.  A  Screen  Dump  of  a  Partial  MITMatlab  Session.  Large  matrices  are 
created  on  the  PPServer  through  special  constructors.  Multiplication  and  other  matrix 
operations  proceed  normallv 


5  Performance 

In  this  section  we  present  results  demonstrating  the  performance  of  the  PPServer. 
We  begin  with  experiments  comparing  the  efficiency  of  individual  operations  in 
Matlab  with  the  same  operations  using  MITMatlab.  We  conclude  with  a  case 
study  of  a  computation  that  requires  more  than  a  single  individual  operation.  We 
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compare  the  performance  impact  of  implementing  a  short  program  in  Matlab, 
directly  on  the  PPServer,  and  using  optimized  Fortran. 

5.1  Individueil  Operations 

We  categorize  individual  operations  into  two  broad  classes,  according  to  the 
amount  of  computation  that  is  performed  relative  to  the  overhead  involved  in 
communicating  with  the  PPServer.  For  fine  grained  operations,  most  of  the  time 
is  spent  communicating  with  the  server.  A  typical  fine  grained  task  would  involve 
accessing  or  setting  an  individual  element  of  a  matrix.  Coarse  grained  operations 
include  functions  such  as  matrix  multiplication,  singular  value  decompositions, 
and  eigenvalue  computations  where  the  majority  of  the  time  is  spent  computing 
instead  of  communicating  input  and  output  arguments  with  the  server. 

Below  we  assess  MITMatlab's  performance  on  both  kinds  of  operations.  Ex¬ 
periments  were  performed  on  a  network  of  Digital  AlphaServer  4/4100s  con¬ 
nected  with  Memory  Channel. 

Fine  Grained  Operations  These  operations  are  understandably  slow.  For 
example,  in  order  to  access  an  individual  element,  the  client  sends  a  message 
to  the  server  specifying  the  matrix  and  location,  the  server  locates  the  desired 
element  among  its  worker  processes,  and  then  finally  sends  the  result  back  to 
the  client. 

MITMatlab  cannot  compete  with  the  local  function  calls  that  Matlab  uses 
for  these  operations.  For  example,  accessing  an  element  in  Matlab  only  takes 
139  microseconds  on  average,  while  on  a  request  from  the  server  such  can  take 
2.8  milliseconds.  This  result  can  be  entirely  explained  by  the  overhead  involved 
in  communicating  with  the  server;  a  simple  “ping  operation  where  MITMatlab 
asks  the  PPServer  for  nothing  more  than  an  empty  reply  takes  2  milliseconds. 


Coarse  Grained  Operations  For  coarse  grained  operations,  the  overhead  of 
client/server  communication  is  only  a  small  fraction  of  the  computation  to  be 
performed. 

Table  1  shows  the  peribimance  of  dense  matrix  multiplication  using  Matlab 
and  MITMatlab.  Large  performance  gains  result  from  the  parallelism  obtained 
by  using  the  server:  however,  even  in  the  case  where  the  server  is  only  using  a 
single  processor,  it  gains  .significantly  over  Matlab.  This  is  due  in  part  because 
the  PPServer  can  use  an  oinimized  version  of  the  BLAS.  This  illustrates  one  of 
the  advantages  of  our  model.  We  can  use  the  fastest  operations  available  on  a 
given  platform. 

Using  PARPACK,  MITMatlab  also  shotvs  superior  performance  in  computing 
singular  value  decompositions  on  sparse  matrices  (see  Table  2). 

It  is  worth  noting  that  Matlab's  operations  were  performed  in  double  preci¬ 
sion  while  the  PPServer  s  ii.sed  single  precision.  While  this  clearly  has  an  effect 
on  performance,  we  do  not  believe  that  it  can  account  for  the  great  performance 
difference  between  the  two  systems. 
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Table  1.  Matrix  multiplication  performance  of  the  MITMatlab  on  p  processors.  Time 
are  in  seconds.  Here  “p  =  3  +  3”  means  6  processors  divided  between  two  machines. 


Matrix  Size  N 

4Kx4K 

Matlab 

MITMatlab 
with  p  =  1 

4-5.1 

357.9 

p  =  2 

2.8 

21.5 

175.6 

p  =  4 

3.9 

12.9 

94.7 

p  =  3  -f  3 

1.4 

14.4 

64.5 

Table  2.  SVD  performance  of  MITMatlab  on  p  processors  using  PARPACK.  These 
tests  found  the  first  5  singular  triplets  of  a  random  lOK  by  lOK  sparse  matrix  with 
approximately  1,  2,  and  4  million  nonzero  elements.  Matlab  failed  to  complete  the 
computation  in  a  reasonable  amount  of  time.  Times  are  in  seconds. 


Processors 

Nouzeros 

used 

■Wl 

mm 

4M 

2 

l|t^ 

[gQ 

4 

gim 

3-1-3 

tlCTSl 

Discussion  These  results  make  it  clear  what  types  of  tasks  are  best  performed 
on  the  server.  Computations  that  can  be  described  as  a  series  of  coarse  grained 
operations  on  large  matrices  fare  very  well.  By  contrast,  those  that  use  many  fine 
grained  operations  may  be  slower  than  Matlab.  Such  tasks  should  be  recoded  to 
use  coarse  grained  operations  if  possible,  or  incorporated  directly  into  the  server 
via  the  package  system.  Note  that  on  many  tasks  that  involve  computation  on 
large  matrices,  fine  grained  operations  occupy  a  very  small  amount  of  time  and 
so  the  advantages  that  we  gain  using  the  server  are  not  lost. 


5.2  Executing  Programs 

Figure  6  shows  the  Matlab  function  that  we  used  for  this  experiment.  It  per¬ 
forms  a  matrix- vector  multiplication  and  a  vector  addition  in  a  loop.  Table  3 
shows  the  results  when  the  function  is  executed:  1)  in  Matlab,  2)  in  Matlab  with 
server  operations,  3)  directly  on  the  server  through  a  package,  and  4)  in  Fortran. 
Experiments  were  performed  using  a  Sun  E5000  with  8  processors.  The  Fortran 
code  used  Sun’s  optimized  version  of  LAPACK. 

The  native  Fortran  version  is  the  fastest;  however,  the  PPServer  package 
version  did  not  incur  a  substantial  performance  penalty.  The  interpreted  MIT¬ 
Matlab  version,  while  still  faster  than  the  pure  Matlab  version,  was  predictably 
slower  than  the  two  c.omi>iled  versions.  It  had  lo  manage  the  temporary  vari¬ 
ables  that  were  created  in  the  loop  and  incurred  a  little  overhead  for  every  server 
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function  called.  We  believe  that  this  small  cost  is  well  worth  the  advantages  we 
obtain  in  ease  of  implementation  (a.  simple  Matlab  script)  and  interactivity. 

A=rand( 3000, 3000) ; 
x0=rand (3000,1) ; 

Q=rand(3000,9) ; 
n=10; 


function  X=testfun(A,xO,Q,n) 

X(: ,l)=xO: 
for  i=l:n-l 

X(:,i+l)=A*X(:,i)+Q(:,i): 

end 


Fig.  6.  Matlab  code  for  the  program  test.  The  Matlab  version  that  used  server  opera¬ 
tions  included  some  garbage  collection  primitives  in  the  loop. 


Table  3.  The  performance  of  the  various  implementations  of  the  program  test.  Al¬ 
though  Matlab  takes  some  advantage  of  multiple  processors  in  the  SMP  we  list  it  in 
the  p  =  1  row. 


Processors 

Used 

Time  (sec) 

Fortran 

Server 

Package 

Matlab 
with  Server 

Matlab 

] 

3.07 

49.93 

2 

1.61 

1.92 

2.43 

4 

0.90 

1.02 

1.49 

6 

0.62 

0.78 

1.26 

0.55 

0.67 

1.84 

6  Conclusions 

Applying  fast  scientific  computing  algorithms  to  large  everyday  problems  rep¬ 
resents  a  major  engineering  effort.  We  believe  that  a  client-server  architecture 
provides  a  robust  approach  that  makes  this  problem  much  more  manageable. 

We  have  shown  that  we  can  create  tools  that  allow  easy  interactive  access 
to  large  matrices.  W'ith  MlTMatlal^,  researchers  can  use  Matlab  as  more  than  a 
prototyping  engine  restricted  to  toy  problems.  It  is  now  possible  to  implement 
full-blown  algorithms  inteiuied  to  work  on  very  large  problems  without  sacrificing 
interactive  power.  MITMatlab  has  been  used  successfully  in  a  graduate  course 
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in  parallel  scientific  computing.  Students  have  implemented  algorithms  from  ar¬ 
eas  including  genetic  algorithms  and  computer  graphics.  Packages  encapsulating 
various  machine  learning  techniques,  including  gradient-based  search  methods, 
have  been  incorporated  as  w'ell. 

Work  on  the  PPServer  continues.  Naturally,  vve  intend  to  incorporate  more 
standard  libraries  as  packages.  \¥e  also  intend  to  implement  out-of-core  algo¬ 
rithms  for  extremely  large  problems,  as  well  implement  interfaces  to  other  clients, 
such  as  Java-enabled  browsers.  Finally,  we  wish  to  use  the  PPServer  as  real  tool 
for  understanding  the  role  of  interactivity  in  supercomputing. 
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Abstract.  In  order  to  address  the  diversity  of  existing  parallel  programming  models,  it  is  important  to  provide  develop¬ 
ment  environments  that  can  be  incrementally  extended  with  new  services.  Concerning  the  debugging  of  process-based 
models,  we  have  previously  designed  and  implemented  a  basic  interface  that  can  be  accessed  by  other  tools  as  well  as  by 
debugging  modules  associated  with  high-level  programming  languages. 

In  this  paper  we  describe  our  work  towards  the  support  of  further  debugging  functionalities  for  parallel  and  distributed 
programs,  by  discussing  a  model  to  support  thread-based  debugging  services.  We  then  show  how  those  services  are 
supported  on  top  of  a  distributed  monitoring  and  control  software  architecture. 

1  Introduction 

In  order  to  ease  the  task  of  parallel  and  distributed  application  development,  a  debugging  service  must  support  the  following 
aspects: 

1 .  Inspection  and  control  of  the  computation  state; 

2.  Tool  interfacing; 

3.  Heterogeneity. 

There  are  several  difficulties  regarding  the  development  of  debugging  services.  On  one  hand,  due  to  the  large  diversity 
of  programming  and  computational  models,  it  is  not  possible  to  define  a  universal  debugging  interface  that  can  meet  the 
requirements  of  all  such  models.  On  the  other  hand,  there  is  an  increasing  number  of  applications  which  are  composed  of 
multiple  separate  components,  each  based  on  its  own  computational  model,  be  it  sequential  or  parallel. 

So  aspect  (1)  depends  on  each  specific  computational  model,  e.g.  process-based,  object-based,  multi-threaded,  as  well 
as  the  underlying  programming  paradigm,  e.g.  imperative  or  declarative.  At  a  basic  level,  as  far  as  parallel  and  distributed 
debugging  is  concerned,  the  following  entities  should  be  modeled:  processes,  threads,  and  their  interactions.  Efforts  such  as 
the  one  from  the  HPDF  initiative  (BFP97]  are  currently  trying  to  establish  a  standard  interface  for  the  most  common  basic 
debugging  functionalities,  that  can  hopefully  improve  the  current  situation. 

Aspects  (1)  and  (2)  were  addressed  in  our  previous  work  ICL97,KCD'*'97,LCK‘^97],  when  we  have  developed  a  dis¬ 
tributed  process-level  debugger  (OOMi  for  C/PVM  programs.  The  DDBG  debugger  was  integrated  in  a  parallel  software 
engineering  environment  within  the  scope  of  an  European  project  [S'*'94]. 

In  both  of  the  above  situations.  »  debugging  service  must  be  able  to  handle  the  requirements  of  very  distinct  models,  and 
this  can  be  achieved  through  heterogeneous  debuggers  (aspect  (3)  above). 

We  have  recently  implemented  j  process-level  debugging  interface  on  top  of  a  very  flexible  monitoring  and  control 
software  architecture  (DAMS)  [CLV  *  *^ttl  One  important  aspect  of  this  architecture  is  that  it  can  be  easily  extended  with 
new  services  and  functionalities,  such  js  (or  debugging,  profiling,  and  distributed  resource  management.  This  allows  an 
incremental  development  of  tools  and  their  experimentation  with  rapid  prototyping. 

In  this  paper  we  extend  such  debugging  functionalities  with  a  thread-based  service,  and  show  how  it  is  implemented  on 
top  of  the  mentioned  architecture. 

The  organization  of  the  paper  is  as  follows.  Section  2  discusses  process  and  thread-based  debugging  services,  and  Sec.  3 
discusses  implementations  on  top  of  the  DAMS  architecture.  Then  we  discuss  related  work  and  present  some  conclusions. 

2  Process  and  Thread-oriented  Debugging  Services 

In  order  to  provide  debugging  functionalities  for  process-  and  thread-based  models,  we  must  identify  the  basic  concepts  and 
mechanisms  supporting  inspection  and  control  of  the  computation  state.  We  define  a  model  that  is  intended  to  be  neutral 
concerning  the  diversity  of  semantics  of  existing  process  and  thread-based  models. 
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2.1  The  Components  of  the  Model 

The  model  defines  the  following  basic  entities: 

-  Processes.  A  process  is  a  passive  entity,  a  kind  of  “capsule”  supporting  contexts  for  the  concurrent  execution  of  multiple 
threads.  A  context  is  defined  by  a  non-empty  set  of  cells.  A  process  is  completely  specified  by  four  types  of  “contexts”: 

•  Process  Memory  Context.  It  corresponds  to  the  process  address  space  which  is  represented  by  a  set  of  values  of 
accessible  memory  cells.  Code,  data  and  stack  regions  are  mapped  onto  such  memory  cells. 

•  Process  Synchronization  Context.  It  contains  cells  representing  synchronization  variables  such  as  locks  and  mu- 
texes,  as  well  as  condition  variables.  Of  course  they  are  also  mapped  onto  memory  cells  but  we  prefer  to  separate 
them  for  greater  clarity  of  the  model. 

•  Process  Communication  Context.  It  is  represented  by  the  values  of  the  input/output  ports  and  the  communication 
channels  (such  as  message  queues).  Such  communication  channels  and  input/output  ports  can  also  be  modeled  by 
associated  memory  cells,  but  they  are  explicitly  identified  here,  because  they  describe  the  process  interaction  with 
its  outside  environment. 

•  Process  Execution  Context.  This  is  defined  by  the  set  of  threads  that  execute  within  the  scope  of  the  process.  Each 
such  thread  has  a  precise  logical  specification  in  terms  of  specific  contexts,  as  described  below.  Additionally,  each 
process  has  an  associated  Scheduling  Context  which  describes  the  status  of  the  physical  processor  scheduling  for 
all  threads  (this  is  not  further  detailed  in  this  paper). 

-  Threads.  A  thread  is  an  active  entity  which  executes  some  code  within  the  contexts  defined  by  its  enclosed  process.  It 
is  specified  by  two  types  of  “contexts”: 

•  Thread  Memory  Context.  It  is  defined  by  the  set  of  values  of  the  memory  cells  containing  the  code,  the  data,  and 
the  stack  regions  that  were  specified  for  each  thread.  Of  course,  both  the  code  and  data  regions  are  shared  by  all 
threads  in  a  single  process,  unlike  the  stack  regions  which  must  be  kept  private. 

•  Thread  Execution  Context.  This  is  defined  by  the  status  of  the  Virtual  Processor  that  is  associated  with  each  thread 
in  order  to  model  its  logical  behavior.  The  status  of  a  virtual  processor  is  defined  by  the  set  of  values  of  the 
processor  registers,  and  by  a  logical  state,  a  cell  containing  one  of  the  values  T_Running,  T_Blocked,  T_Stopped, 
T_Terminated. 

The  thread  logical  state  transition  diagram  presented  in  Fig.  1  identifies  the  possible  state  transitions  allowed  to  a  thread, 
identifying  at  the  same  time  some  of  the  debugging  functionalities  that  trigger  each  state  transition.  Associated  with  each 
transition  in  the  state  diagram  there  is  a  set  of  labels  naming  the  possible  causes  of  the  transition.  Their  name  suggest  the 
associated  functionality.  Labels  between  angle  braces,  such  as  <T_Exit>,  define  actions  resulting  of ‘the  thread  execution 
and  generated  internally  or  by  the  system.  Other  labels,  such  as  T_step,  identify  transitions  forced  by  an  external  agent, 
such  as  the  debugger. 


Fig.  1.  The  thread  state  transition  diagram 
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-  T_DETACHED.  The  thread  is  running  free  and  it  is  not  under  control  of  the  debugger. 

-  T_RUNNING.  The  thread  is  running  under  control  of  the  debugger. 

-  T_STOPPED.  The  thread  is  stopped  as  a  result  of  a  debugger  command  or  due  to  the  occurrence  of  some  exception. 

-  T_BLOCKED.  The  thread  has  invoked  a  blocking  call  and  is  temporarily  blocked  until  that  request  is  satisfied. 

-  T_TERMINATED.  The  thread  has  terminated  due  to  a  debugger  or  system  command,  or  because  it  has  reached  its  exit 
point. 


2.2  Events 

Using  such  a  model,  we  are  able  to  precisely  identify  the  events  which  are  relevant  to  describe  and  control  a  concurrent 
computation  with  multiple  processes  and  threads. 

In  the  following  we  briefly  illustrate  how  this  model  can  help  in  the  process  of  precisely  specifying  the  operational 
semantics  of  debugging  primitives  in  terms  of  events. 

Generally,  given  a  specific  Context  (as  previously  defined  above)  an  event  is  defined  by  a  modification  in  a  single  value 
of  a  cell  contained  in  that  context.  This  corresponds  to  the  basic  notion  that  an  event  describes  a  transition  from  one  state  to 
another  state. 


Process-level  Events  These  events  describe  all  modifications  made  to  any  of  the  contexts  defined  in  the  process  (Memory, 
Synchronization,  Communication,  and  Execution).  For  example,  events  are  triggered  by  modification  of  global  process 
variables,  by  modifications  of  the  state  of  a  mutex,  by  the  arrival  of  a  message,  or  by  the  creation  or  destruction  of  a  thread 
in  a  process. 


Thread-level  Events  These  events  describe  the  modifications  in  the  thread  Memory  and  Execution  contexts.  For  example, 
the  modification  of  a  local  thread  variable,  or  a  physical  processor  register.  Thread-level  events  are  also  triggered  by  any 
change  in  the  logical  state  of  its  virtual  processor. 


2.3  Actions 

An  action  is  responsible  for  the  state  modification  that  triggers  each  event.  We  identify  two  classes  of  actions; 

-  Internal  Actions.  They  are  enforced  by  the  virtual  processor  associated  with  a  given  thread  in  a  process.  The  sequence 
of  all  pairs  (Internal_action,  Generated_Event)  that  are  produced  during  thread  execution,  precisely  specify  the  compu¬ 
tation  path  followed  by  the  thread.  Such  internal  actions  may  correspond  to  physical  processor  instructions  or  to  higher 
level  instructions,  for  example  C  code  statements. 

-  External  Actions.  They  are  enforced  by  external  controller  entities  such  as  the  debugger,  acting  upon  the  contexts 
defined  within  a  process.  The  sequence  of  all  pairs  (Extemal_action,  Generated_Event)  gives  the  history  of  a  debugging 
session. 

2.4  The  Debugging  Activity 

Debugging  functionalities  fall  into  two  categories:  inspection  commands,  and  controlling  commands.  On  the  other  hand, 
they  can  refer  to  individual  processes  or  threads.  They  can  also  refer  to  process  interactions  or  thread  interactions.  The  core 
of  the  debugging  activity  amounts  to  ob.serve  and/or  enforce  well-defined  sequences  of  events  so  that  deviations  from  the 
program  specification  can  be  localized  and  corrected.  Our  model  provides  a  foundation  to  develop  a  mechanism  that  controls 
the  detection  and  registering  of  events.  Basicly  event  detection  can  be  enabled  for  a  well-defined  class  of  action/event  pairs. 
For  example: 

-  Detect  events  in  a  given  process/thread,  associated  with  its  Memory  context,  which  were  generated  by  internal  actions 
only.  It  is  possible  to  detect  events  associated  with  a  given  memory  cell. 

-  Detect  events  in  a  given  process,  associated  with  its  Synchronization  context,  and  generated  by  internal  actions  of  a 
given  thread. 

-  Detect  events  in  a  given  process/thread,  associated  with  the  logical  state  of  its  Execution  context,  and  were  generated 
by  external  actions. 
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In  general  it  is  possible  to  selectively  enable/disable  event  detection  for  specific  types  through  the  specification  of  the 
following  elements; 

-  Which  class  of  action  triggers  the  event  (External,  Internal). 

-  Which  entity  should  be  monitored  (Process,  Thread,  Context,  Cell). 

Well-known  debugging  primitives  can  easily  be  represented  in  terms  of  this  model.  For  example,  concerning  threads,  a 
command  “set_var()”  of  a  local  variable  in  a  given  thread  would  generate  an  event  related  to  the  Thread  Memory  Context, 
A  command  “set_breakpoint()”  in  a  given  thread  would  relate  to  the  Thread  Execution  Context.  Concerning  processes,  a 
command  “kill_thread()”  would  relate  to  the  Process  Execution  Context  (and  also  to  the  Thread  Execution  Context  because 
it  also  changes  the  thread  state). 

By  monitoring  the  occurrence  of  events  of  a  certain  type,  it  is  possible  to  construct  event  histories  that  contribute  to 
a  better  understanding  of  the  concurrent  computation.  For  example,  in  , order  to  implement  a  deterministic  replay  facility 
concerning  process  interactions  only  (i.e.  message  exchange),  one  needs  to  enable  the  detection  of  events  related  to  the 
Process  Communication  Context.  A  replay  facility  for  thread  interactions  internal  to  a  single  process  depends  on  the  enabling 
of  events  related  to  Process  Memory  and  I*rocess  Synchronization  Contexts. 

2.5  Asynchronous  Event  Notification 

Several  types  of  debugging  commands  provide  an  immediate  response,  e.g.  as  in  a  “set_var()”  or  “set_breakpoint()”,  which 
give  a  success  or  failure  indication,  and  possibly  return  some  result  (e.g.  a  breakpoint  identification). 

Other  types  of  debugging  commands  typically  act  upon  Thread  Execution  Contexts  in  such  a  way  that  it  is  not  possible 
to  obtain  immediate  meaningful  imformation,  besides  knowing  that  the  command  was  successfully  applied.  For  example, 
commands  like  “continueO”  or  “next()”  immediately  originate  a  logical  state  transition  in  a  thread  (from  T_STOPPED  to 
T_RUNNING),  but  it  might  take  an  unpredictable  amount  of  time  for  the  thread  to  reach  a  point  that  should  be  inspected 
during  debugging,  e.g.  to  reach  a  breakpoint.  In  general,  the  debugger  interface  or  the  application  that  is  invoking  debugging 
commands  should  not  be  forced  to  wait  until  the  desired  event  is  reached.  Instead,  an  asynchronous  event  notification 
mechanism  must  be  provided  by  the  debugging  interface,  allowing  a  thread  to  explicitly  register  its  interest  in  receiving 
event  notifications  through  the  declaration  of  an  event  handler. 

This  declaration  is  achieved  by  calling  the  service 

T_sethandler  (process_thread_list,  event_type,  handler) 

which  defines  the  function  handler  as  an  handler  of  events  of  the  given  type  (according  to  the  previous  subsection) 
which  are  originated  from  any  of  the  processes  or  threads  from  process_thread_list.  Multiple  threads  in  the  same  or 
different  processes  can  register  handlers  for  a  specific  type  of  events.  If  such  event  occurs,  a  notification  is  sent  to  all  the 
registered  threads.  When  a  thread  receives  an  event,  its  current  execution  is  suspended  while  the  associated  handler  function 
is  executed. 

This  event  mechanism  is  also  used  to  support  tool  synchronization  and  coordination  in  an  integrated  software  devel¬ 
opment  environment  where  multiple  tools  ( for  debugging,  testing,  visualization,  etc.)  concurrently  observe  and  control  the 
evolution  of  a  computation.  This  coordination  o  ashieved  by  having  some  tools,  e.g.  a  graphical  user-interface  or  a  thread- 
based  visualization  tool,  registering  handlers  related  to  the  occurrence  of  some  types  of  events,  that  may  be  originated  by 
internal  and/or  external  actions  (e.g.  setting  breakpoints).  On  event  occurrence,  such  tools  can  react  and  update  the  graphical 
view  that  is  being  presented  to  the  user,  con\i»ienil\  *ith  the  evolution  of  the  computation  and  with  the  actions  triggered  by 
the  debugging  tool. 

2.6  Summary  on  the  Debugging  Functions 

In  this  section  we  have  discussed  how  an  event-based  model  can  provide  the  foundation  to  develop  process  and  thread  based 
debugging  services.  In  this  paper  we  have  not  presented  the  interface  of  process  and  thread  debugging  primitives.  Our  goal 
is  to  be  able  to  support  distinct  and  evolving  interface  primitives  so  that  our  debugging  framework  can  be  used  to  support 
experimentation  and  building  of  prototypes. 

3  A  Process-  and  Thread-oriented  Debugging  Tool 

In  this  section  we  discuss  implementation  issues,  including  the  support  for  multiple  connections  from  concurrent  client 
tools,  as  well  as  the  infrastructure  for  implementing  the  debugging  functionalities  that  we  have  outlined  in  the  previous 
section. 
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Fig.  2.  The  TDB6  architecture 


3.1  The  DAMS  system 

The  DAMS  {Distributed  Application  Monitoring  System)  system  provides  the  basic  layer  to  support  the  incremental  devel¬ 
opment  of  parallel  and  distributed  monitoring  and  control  services,  such  as  debugging,  profiling  and  resource  management. 
Its  design  and  implementation  are  neutral  regarding  the  programming  and  computational  models  of  the  target  application. 

The  processes  related  to  DAMS  can  be  classified  in  one  of  three  classes  (see  Fig.  2): 

-  Target  application  processes.  The  set  of  processes  that  will  be  controlled/monitored  by  the  DAMS  system. 

-  Client  application  processes.  The  set  of  independent  client  tools,  that  may  operate  concurrently  over  the  Target  appli¬ 
cation  processes  by  issuing  requests  to  the  DAMS  system,  through  a  service  interface  library. 

-  The  DAMS  processes.  The  set  of  internal  processes  that  implement  the  DAMS  system  and  its  services.  This  set  includes; 

•  System  processes.  This  includes  a  single  Service  Manager  and  several  Local  Managers,  one  per  physical  node  of  the 
target  architecture.  These  processes  manage  the  internal  DAMS  resources  and  provide  an  architecture  independent 
communication  layer  that  allows  the  Client  application  processes  to  control  and  inspect  the  evolution  of  the  Target 
application  processes. 

•  Service  processes.  Each  class  of  service  {e.g.  debugging,  resource  management,  profiling)  requires  a  DAMS  config¬ 
uration  which  includes  a  set  of  specific  components:  one  Service  Module,  to  handle  the  Client  application  service 
requests  and  their  high-level  system-independent  parts;  and  a  set  of  Driver  processes,  usually  one  per  process  of 
the  Target  Application,  to  implement  the  low-level  system-dependent  control  and  monitoring  aspects. 

The  most  important  aspects  of  DAMS  are:  its  extensibility;  its  neutrality  concerning  the  target  application  models;  its 
builtin  support  for  multiple  concurrent  connections  from  client  tools;  and  its  functionalities  for  tool  coordination  and  syn¬ 
chronization  using  events. 

3.2  The  TDBG  tool 

In  [CLV+98]  we  have  described  the  implementation  of  PDB6,  a  process-level  debugger  as  a  DAMS  service.  Here  we  describe 
how  thread-level  debugging  (the  TDBG  debugger)  is  implemented  as  a  service  on  top  of  the  DAMS  system  by  the  provision 
of  an  adequate  set  of  Service  processes. 

For  a  better  understanding  of  how  the  TDBG  components  interact,  we  present  an  example,  which  also  refers  to  Fig.  2. 
There  are  three  Target  application  processes',  two  Client  applications:  the  Graphical  Interface  and  the  Text  Interface;  and, 
for  simplicity,  the  pictured  DAMS  configuration  is  providing  the  TDBG  service  only. 
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Let  us  consider  that  a  client  application  (e.g.  a  Text  Interface)  issues  a  debugging  command  by  calling  a  debugging 
library  function,  that  sets  a  breakpoint  in  a  given  line  of  a  given  thread  e.g.  set_break  (t  12345,  1  98).  This  func¬ 
tion  establishes  the  communication  with  the  Service  Manager,  which  identifies  the  type  of  requested  service  (related  to 
debugging),  and  forwards  it  to  the  appropriate  component:  the  Debugging  Module. 

The  Debugging  Module  parses  the  received  data,  identifies  the  type  of  request,  and  then  sends  the  (possibly)  transformed 
request  to  the  Debugging  Driver.  The  BAMS  system  internally  manages  the  routing  tables  to  assure  that  the  request  reaches 
the  desired  Debugging  Driver  which  is  associated  with  the  identified  target  process. 

In  order  to  allow  easy  plug-in  of  existing  commercial  or  public-domain  Node  Debuggers,  the  Debugging  Driver  includes 
a  front-end  process,  called  a  Controller,  which  is  responsible  for  all  interactions  with  the  actual  Debugger.  The  Controller 
acts  as  a  kind  of  “user”,  as  far  as  the  Debugger  is  concerned' . 

After  parsing  the  data  that  was  sent  by  the  Debugging  Module,  the  Controller  identifies  the  target  process,  and  is¬ 
sues  an  adequate  sequence  of  commands  conforming  to  the  existing  Debugger  interface  e.g.  select_thread  12345, 
break_line  98.  The  Controller  waits  for  the  completion  of  each  command  before  issuing  the  next  one.  When  the  se¬ 
quence  is  terminated,  the  results  of  the  command,  e.g.  local_brkpt_id=2,  are  sent  back  to  the  Debugging  Module. 

The  Debugging  Module  parses  the  received  data,  and  does  the  necessary  post-processing,  for  example  converting  a  local 
breakpoint  identifier  into  a  global  breakpoint  identifier,  e.g.  global_brkpt_id=14.  Afterwards,  it  sends  the  results  back 
to  the  Client  process  in  the  form  of  return  values  of  the  invoked  library  call. 

3.3  Summary  on  BAMS  and  TBBG 

By  describing  how  the  interfacing  between  the  client  tools  and  the  TBBG  debugger  is  done,  we  have  illustrated  the  great 
flexiblility  of  the  BAMS  architecture  in  order  to  support  extended  functionalities.  Namely,  it  is  possible  to  integrate  multiple 
heterogeneous  target  debuggers,  for  processes  and  threads,  in  a  single  BAMS  configuration. 

4  Related  Work 

There  are  many  current  efforts  on  the  field  of  parallel  and  distributed  debugging  (with  and  without  thread’s  support)  and 
related  topics  [LWSB97,Zho94,MB94,Lum95,XWZS96,PHK91,DJ88,HS88].  Because  we  cannot  cover  them  all  here,  we 
have  chosen  two  related  approaches  that  are  briefly  presented  and  compared  with  our  own  approach.  The  first  one  concerns 
the  specification  of  debugging  functionalities  and  the  second  concerns  a  distributed  design  supported  by  an  existing  tool. 

4.1  The  HPBP  (proposed)  Standard 

The  High-Performance  Debugging  Forum  (HPBF)  [BFT*97]  is  a  collaborative  effort  between  researchers  and  industry,  aim¬ 
ing  to  define  a  standard  for  parallel  debuggers.  As  of  Version  1  of  the  standard,  a  command  based  (non-graphical)  interface 
has  been  prepared,  specifying  either  syntax  and  semantics  of  the  proposed  services.  The  definition  of  graphical  interfacing 
and  complex  I/O  operations  are  still  under  work. 

According  to  HPBF,  a  parallel  debugger  is  either  a  thread-oriented  debugger,  a  process-oriented  debugger  or  a  hybrid 
debugger,  and  sets  of  required  and  recommended  services  have  been  defined  for  each  of  them.  Our  design  can  easily 
accommodate  most  of  the  HPBF  proposed  functionalities  for  hybrid  debuggers. 

In  this  regard,  the  tool  integration  features  of  TBBG,  presenting  an  unified  event-based  model  for  the  internal  and 
external  actions,  is  a  distinct  contribution  to  the  integration  of  parallel  debuggers  in  more  complete  and  complex  program 
development  environments  [KCD'''97,LCK+97]. 

4.2  The  p2d2  Distributed  Debugger 

The  p2d2  distributed  debugger  [Hoo96]  is  a  process-oriented  debugger.  It  uses  a  client-server  approach,  with  a  well  defined 
interface,  promoting  portability  by  isolating  the  system  dependent  code  into  a  debugger  server.  There  is  an  user- interface 
capable  of  displaying  and  controlling  many  processes,  individually  or  associated  in  groups.  The  GNU  gdb  is  used  as  a  Node 
Debugger,  and  a  call-back  method  supports  asynchronous  interactions  between  gdb  and  the  user-interface. 

The  distinctive  feature  of  our  approach  (i.e.  TBBG-i-BAMS)  is  to  support  multiple  concurrent  client  tools  and  to  offer  the 
necessary  mechanisms  to  implement  client  tool  coordination. 

From  an  implementation  point  of  view,  the  existing  Node  Debugger  must  provide  an  interface  library  to  be  accessed  by  the  Controller 
front-end. 
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5  Conclusions  and  Ongoing  Work 

In  this  paper  we  have  discussed  a  model  to  support  the  development  of  process  and  thread  debugging  functionalities,  and 
their  implementation  as  services  of  the  DAMS  distributed  architecture.  This  work  is  part  of  our  experimentation  towards  the 
incremental  building  of  tool  support  services  for  parallel  and  distributed  processing. 

There  is  a  prototype  of  DAMS  running  on  our  Ethernet  LAN  with  Linux/PC’s  nodes,  and  a  cluster  of  FDDI-interconnected 
Alpha  processors  under  OSF/1 .  A  process-level  debugger  (PDBG)  runs  as  a  DAMS  service,  and  uses  the  GNU  gdb  as  the  target 
debugger.  The  efficieny  of  this  prototype  suffers  because  gdb  is  very  heavy. 

This  prototype  is  being  extended  to  implement  TDBG  which  provides  a  thread-based  debugging  service  according  to 
the  description  in  Sec.  3.2.  A  different  Node  Debugger  is  used,  namely  SmartODB  [Hal92],  which  is  a  thread-oriented 
debugger,  extending  GNU  gdb  with  TCL  scripting  capabilities  and  debugging  support  for  user-level  threads. 

An  ongoing  related  project  focus  on  the  integration  of  TDBG  and  a  visualization  tool  for  thread-based  programs.  In  this 
project  we  are  experimenting  with  the  TDBG  tool  integration  and  coordination  support  mechanisms. 
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Abstract  In  vector  processors,  when  several  vector  streams  concurrently  access  the 
memory  system,  references  of  different  vectors  can  interfere  in  the  access  to  the 
memory  modules,  causing  module  conflicts.  Besides,  in  a  memory  system  where 
several  modules  are  mapped  in  every  bus.  delays  due  to  bus  conflicts  are  added  to 
module  conflict  delays.  This  paper  proposes  an  access  order  to  the  vector  elements  that 
avoids  conflicts  when  the  concurrent  access  corresponds  to  vectors  of  a  subfamily,  and 
the  request  rate  to  the  memory  modules  is  less  than  or  equal  to  the  service  rate.  For 
other  cases  of  concurrent  access,  the  proposal  dramatically  reduces  conflicts. 


1  Introduction 

In  vector  processors,  the  ideal  execution  of  a  memory  vector  instruction  would  permit  to 
obtain  a  datum  at  every  cycle  after  an  initial  latency.  As,  in  general  the  memory  module 
reservation  time  is  much  longer  than  the  processor  cycle  time,  the  memory  system 
usually  consists  of  multiple  memory  modules  with  independent  access  paths. 

Usually,  vector  processors  have  more  than  one  port  to  the  memory  subsystem  to 
allow  several  memory  vector  instructions  to  proceed  concurrently.  Under  these 
conditions,  conflicts  appear  in  the  acces.s  to  the  memory  modules  when  two. or  more 
references  are  simultaneously  issued  to  the  same  module.  Besides, .a  reference  to  a  busy 
module  also  causes  a  memory  module  conflict. 

In  vector  processors  with  several  paths  to  the  memory  system,  or  in  multi-vector 
processors,  another  factor  that  affects  the  performance  of  the  memory  system  is  the 
interconnection  network  between  processors  and  memory  modules.  In  the  design  of 
some  memory  systems,  the  decision  of  reducing  the  number  of  independent  access  paths 
to  the  memory  modules  (several  modules  are  mapped  on  every  bus)  [2][6],  implies  a 
reduction  in  its  economic  cost.  However,  this  solution  implies  assuming  the  presence  of 
conflicts  in  the  access  to  the  interconnection  network,  as  well  as  the  memory  module 
conflicts  mentioned  above.  Both  type  of  conflicts  appear  even  in  the  specially  common 
case  of  several  one-strided  vector  streams  concurrent  access.  The  main  effect  of  the 
conflicts  is  the  starvation  of  the  functional  units,  with  the  subsequent  loss  of 
performance. 

Memory  vector  instructions  with  regular  access  patterns  generate  periodical  conflicts 
as  these  kind  of  instructions  generate  periodical  streams  of  references  (vector  streams 
with  a  constant  stride).  In  the  context  of  this  paper,  our  interest  is  the  reduction,  and  the 
elimination  when  possible,  of  the  memory  conflicts  (interconnection  and  memory 
module  conflicts)  caused  by  concurrent  constant-strided  vector  streams. 

Several  kind  of  methods  have  been  proposed  to  reduce  the  number  of  cycles  lost  due 
to  memory  conflicts.  Some  authors  propose  to  accurately  place  in  memory  the  vectors  to 
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be  concurrently  accessed  [10][14][I7].  This  technique  implies  that  patterns  must  be 
known  in  compilation  time,  and,  the  access  to  a  vector  stream  in  different  context  of  a 
program  could  decrease  its  effectiveness.  Other  authors  propose  the  use  of  buffers  in  the 
memory  modules  [17]  or  in  the  interconnection  network  [19].  Buffers  allow  the 
requesting  processor  to  keep  sending  requests  without  waiting,  but  this  technique 
requires  labelling  the  memory  references  to  allow  their  reordering  before  being  used  by 
the  processor;  the  cost  of  the  interconnection  network  increases  as  the  tag  must  be  sent 
along  with  the  request  [17].  In  addition,  buffers  do  not  directly  solve  the  problem  of  the 
convergence  to  a  single  port  of  the  requests  in  the  return  network  [21]. 

Our  proposal  consists  of  a  new  access  order  to  the  vector  stream  elements.  In  parallel 
with  our  work,  other  authors  have  studied  this  kind  of  solution  [15].  This  new  order 
working  with  a  new  arbitration  algorithm  will  help  concurrent  vector  streams  perform 
their  memory  request  with  no  conflicts  or  less  number  of  conflicts  than  the  classical 
access  implies. 

One  of  the  cases  for  which  our  proposal  completely  avoids  conflicts  is  the  very 
common  case  of  the  concurrent  access  of  several  one-strided  vector  streams.  J.  Fu  and 
J.H.  Patel  in  [7]  show  that  between  7%  and  54%  of  the  vector  streams  in  four  programs 
of  the  Perfect  Club  benchmark  set  [1]  (ADM,  ARC2D,  BDNA  and  DYESM)  access  the 
memory  with  stride  1 . 

Section  2  outlines  the  architecture  model,  on  which  the  present  study  is  based,  and 
the  characterization  of  the  interleaving  mapping  and  vector  access  functions.  The 
interaction  between  vector  streams  in  a  complex  memory  system  is  studied  in  Section  3. 
Section  4  presents  the  proposed  access  order  to  the  memory  modules  and  presents  its 
hardware  support.  Finally,  Section  5  deals  with  the  comparison  between  the  proposal 
and  the  method  used  in  a  classical  system,  like  CRAY  X-MP. 

2  Architecture 

The  memory  architecture  presented  in  Fig.  1  is  an  example  of  the  complex  memory 
system,  similar  to  the  one  used  in  the  CRAY  X-MP  [2]. 

The  memory  subsystem  consists  of  AY  =  2'"  memory  modules  (memory  cycle,  n^.  =  2^ 
clock  cycles),  connected  to  P  =  La//w,  J  memory  ports  through  an  interconnection 
network.  To  reduce  the  number  of  access  paths  to  the  memory  subsystem  the  memory 
modules  are  distributed  into  SC  sections.  A  memory  module  request  occupies  the 
section  path  where  the  module  is  located  during  one  cycle.  It  is  supposed  that  SC  =  2^' , 
and  the  number  of  memory  modules  is  a  multiple  of  5C. 

In  each  cycle,  every  port  requests  an  element  of  a  vector  stream  except  when  a 
conflict  appears  in  the  interconnection  network  or  in  a  memory  module.  In  case  of 
conflict,  only  one  vector  stream  obtains  the  access  and  the  other  requests  must  wait;  a 
priority  rule  must  determine  which  port  will  be  able  to  proceed.  In  the  present  paper,  we 
use  the  arbitration  implemented  in  the  CRAY  X-MP  [2],  to  measure  the  performance  of 
the  classical  access  (Definition  5)  and  in  the  examples  of  concurrent  access  when 
another  algorithm  is  not  specified.  This  arbitration  gives  priority  to  the  vector  stream 
with  the  lower  2”  stride  factor;  for  ports  with  same  parity  of  strides,  the  priority  is  fixed. 

The  memory  is  organized  as  an  interleaved  address  mapping  model  {section  =  Aj 
mod  SC,  memory  module  =  A,  mod  M,  offset  =  L>4/AYJ).  The  interleaving  function  which 
maps  the  address  into  memory  modules  has  a  period  of  P=M. 
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The  following  definitions  will  help  the  reader  to  follow  the  method. 

Definition  1:  A  vector  stream  A  =  (Aq,  S,  VL)  is  the  set  of  references  to  memory 
modules  {A, I  A,  =  Aq  +  ixS,  0<i  <  VL],  where  Aq  is  the  address  of  the  first  reference,  5 
(stride)  is  the  distance  between  two  consecutive  references  and  VL  is  the  vector  length, 
or  number  of  references.  If  the  length  is  not  relevant  a  stream  is  specified  as  A  =  (Aq,  S). 

Vector  streams  can  be  classified  into  different  families  according  to  their  stride. 
Definition  2:  A  stride  family  (F,)  is  the  set  of  vector  streams  with  strides  S  =  G  X  2\ 
where  a  is  an  odd  factor  [9]. 

A  vector  stream  with  a  stride  S  =  <jx2^  references  Pf=  M/  gcd(M,  2'j  memory 
modules  periodically,  and  the  period  is  Py 

Definition  3:  The  memory  module  set  (MMS)  of  the  vector  stream  A  =  (A^,  S)  is  the  set 
of  all  the  memory  modules  accessed  by  the  vector  stream  A=(Aq,S,PJ.  MMS  -  {mj\ 
mj=(AQ+ixS)modM,  0<i<PJ. 

Definition  4:  A  stride  subfamily  (SF^o  )  is  the  set  of  vector  streams  of  a  family  that 
reference  the  same  set  of  memory  modules. 

To  give  some  examples,  the  family  Fq  (odd-strided  vector  streams)  only  has  one 
subfamily  SF^  ,  and  the  family  F]  (even-strided  vector  streams)  has  two  subfamilies, 
SF^j  references  the  even  memory  modules,  and  SF\  references  the  odd  modules. 
Definition  5:  Classical  access  is  the  access  order  that  uses  the  recurrence  A,.^y  =  A,  +5 
(S=Stride)  to  compute  vector  stream  addresses. 

Since  the  vector  length  is  usually  greater  than  the  vector  register  length,  the  compiler 
is  required  to  transibrm  the  code  using  strip-mining.  Under  this  condition,  a  great 
proportion  of  memors  accesses  from  vector  streams  are  issued  by  vector  instructions 
load  and  store,  which  are  ol  a  fixed  length  equal  to  the  vector  register  length.  Let  us 
assume  that,  in  order  in  simplify  the  explanation  of  the  proposed  method,  the  vector 
stream  length  ( VZ,  =  2' '  i  is  a  multiple  of  the  vector  register  length  MVL  =  2""'^  which  is 
assumed  to  be  a  multiple  ot  the  number  of  memory  modules  M  =  2"'. 


3  Characterization  ot  the  Conflicts 

Only  in  the  case  that  the  memory  request  rate  imposed  by  concurrent  vector  streams  is 
equal  to  or  less  than  the  memory  module  response  rate,  the  concurrent  access  can  be 
conflict-free.  When  the  request  rate  is  equal  to  the  response  rate,  it  is  said  that  the 
memory  system  (or  similorls  ihe  memory  modules)  works  tight,  and  when  the  request 
rate  is  less  than  the  response  rate,  the  memory  system  works  loose. 

To  obtain  a  conflict-free  access,  not  only  the  system  must  work  loose  or  tight,  in 
addition,  the  concurrent  access  of  the  vector  streams  must  fulfil  two  conditions: 

•  consecutive  references  to  a  memory  module  must  be  distanced  at  least  n,.  cycles  (to 
avoid  memory  module  conflicts). 
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•  since  memory  modules  share  sections,  only  a  few  sets  of  concurrent  memory 
module  references  are  correct  (to  avoid  section  conflicts). 

To  analyse  the  effect  of  the  first  condition,  we  first  study  a  memory  system  that  can 
only  present  conflicts  in  the  access  to  the  modules,  not  in  the  interconnection  network. 
Then,  we  extend  the  study  to  a  complex  memory  system  to  discuss  the  second  condition. 

Simple  Memory  System 

A  simple  memory  system  has  an  independent  access  path  from  every  port  to  every 
memory  module,  thereby  its  interconnection  network  does  not  present  conflicts.  In  a 
system  like  that,  the  concurrent  classical  access  of  vector  streams  that  have  the  same 
stride  has  a  conflict-free  steady  state  when  the  request  rate  they  imply  is  less  than  or 
equal  to  memory  modules  response  rate  (the  system  works  loose  or  tight)  [16][17]. 

Fig.  2  presents  the  concurrent  classical  access  of  four  one-strided  vector  streams  in  a 
memory  system  with  16  memory  modules  and  an  n,.  of  4  cycles.  Vector  streams  start 
their  concurrent  access  in  different  memory  modules.  In  the  figure,  it  is  possible  to 
observe  for  every  cycle  the  memory  module  that  begins  to  be  occupied  by  every  vector 
stream  (the  module  remains  occupied  during  latency  cycles).  A  delay  due  to  a  memory 
module  conflict  is  depicted  in  black. 
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Fig.  2. 16- way  interleaved  memory  with  =  4.  Conflicts  with  the  classical  access. 

This  concurrent  access  presents  conflicts  at  the  very  beginning,  but  the  steady  state, 
that  starts  at  cycle  8,  is  conflict-free.  At  the  steady  state,  four  sets  of  concurrent  memory 
module  references  ({0,  4,  8,  12).  {1,  5,  9,  13),  {2,  6,10,  14)  and  (3,  7,  11,  15))  are 
periodically  repeated  every  n,.  cycles,  thereby,  consecutive  references  to  the  same 
memory  module  are  distanced  n,.  cycles.  The  periodicity  of  these  four  sets  (called  CMR 
-Concurrent  memory  Module  References-  from  now  on)  can  be  guaranteed  because 
vector  streams  reference  the  memory  modules  with  the  same  order. 

R.  Raghavan  and  J.R  Hayes  stated  with  theorem  6  of  [17]  the  conditions  the 
concurrent  vector  streams  must  fulfil  to  obtain  a  conflict-free  classical  access  in  a 
simple  memory  system.  These  conditions  can  be  fulfilled  only  by  vector  streams  that 
belong  to  the  same  subfamily.  All  the  combinations  of  vector  streams  that  have  the  same 
stride  have  a  conflict-free  access  whenever  the  system  works  loose  or  tight.  The 
concurrent  classical  access  of  vector  streams  of  different  subfamilies  is  always 
conflict! ve  (corollary  3  of  [4]). 

Complex  Memory  System 

Combinations  of  vector  streams  that  obtain  a  conflict-free  access  in  a  simple  memory 
system,  may  not  have  a  good  behaviour  in  a  complex  memory  system.  The  sets  CMR 
that  are  suitable  in  a  simple  memory  system  may  not  be  appropriated  in  a  system  where 
several  memory  modules  are  mapped  in  the  same  section.  As  an  example,  none  of  the 
CMR  of  the  concurrent  access  of  Fig.  2  are  appropriated  in  a  complex  memory  system 
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where  the  16  memory  modules  are  interleavedly  mapped  in  4  sections  (Fig.  1):  all  the 
memory  modules  of  every  CMR  are  mapped  in  the  same  section,  then,  they  can  not  be 
concurrently  accessed. 

Fig.  3  shows  the  conflictive  classical  access  of  four  one-strided  vector  streams  in  the 
system  of  Fig.  1.  The  delay  due  to  a  section  conflict  is  represented  in  light  grey,  and  a 
memory  module  conflict  is  depicted  in  black;  a  section  is  locked  during  one  cycle  in  the 
access  to  a  memory  module.  In  this  concurrent  access,  conflicts  are  linked  and 
periodically  repeated:  a  section  conflict  causes  a  memory  module  conflict  which  also 
causes  a  section  conflict,  and  so  on. 
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Fig.  3.  1 6-way  interleaved  memory  system  with  n,.  =  4  and  5C=4. 

Conflicts  with  the  classical  access. 

T.  Cheung  and  J.E.  Smith  characterize  in  [2]  the  linked  conflicts  that  appear  in  the 
concurrent  classical  access  of  two  one-strided  vector  streams  and  use  the  term  complex 
linked  conflict  (complex  conflict)  when  three  or  more  vector  streams  interfere  with  each 
other  in  a  less  precise  way.  Authors  prove  that  the  steady-state  linked  conflicts  and 
complex  conflicts  reduce  the  effective  bandwidth. 

Authors  of  [2]  show  that  in  the  concurrent  classical  access  of  three  one-strided  vector 
streams  (the  system  works  loose),  in  34%  of  the  cases  (combinations  of  initial  memory 
modules)  linked  conflicts  appear,  in  7%  of  the  cases  complex  conflicts  are  generated, 
and  performance  can  be  degraded  by  20%. 

To  solve  these  conflicts.  W,  Oed  and  O.  Lange  conclude  in  [16]  that  and  SC  must 
be  coprime  (theorem  9).  A  solution  with  a  prime  SC  is  proposed  in  [15].  In  [2],  authors 
give  some  alternatives  to  avoid  linked  conflicts,  i.e.  a  solution  with  odd  values  of  n,..  For 
all  the  proposals,  if  vector  streams  have  different  strides  conflicts  persists  and,  in  any 
case,  complex  conflicts  do  not  disappear. 

Tab.  1  shows  the  asymptotic  number  of  operations  per  cycle*  (R„)  the  classical 
access  obtains  in  average  for  four  types  of  combinations  of  vector  streams,  in  a  simple 
memory  system  (M=16  and  /!,=4)  and  in  the  corresponding  complex  system  (M=I6, 
n,.=4  and  5C=4).  The  concurrent  accesses  simulated  are  all  the  combinations  of  four, 
three  and  two  odd  strided  vector  streams,  two  odd  strided  with  one  even  strided  vector 
streams,  and  two  even  strided  vector  streams.  For  the  simple  memory  system,  the 
average  R„  for  the  classical  access  is  far  away  from  the  ideal,  even  for  combinations  for 


1 .  R„=  ops  X  X  f,,,  where  is  the  processor  cycle  time,  ops  is  the  number  of  concur¬ 
rent  vector  streams,  and  r^  is  the  asymptotic  performance  [12]. 
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which  the  system  works  very  loose  and  vector  streams  belong  to  the  same  subfamily. 
Comparing  the  results  for  both  memory  systems,  it  can  be  easily  concluded  that  in  a 
complex  memory  system,  the  results  are  worst  because  of  interferences  in  the 
interconnection  network. 

Tab.  1.  /?„  for  the  classical  access  and  Ideal. 


Combinations 
of  Strides 

Complex  Mem.  Syst. 
M=J6n,  =  4  SC=4 

Simple  Mem.  Syst. 
M=I6  n,.  =  4 

Odd 

Even 

Pc 

Classical 

Pc 

Ideal 

Classical 

Ideal 

4 

0 

1.57 

4 

1.86 

4 

3 

0 

1.51 

3 

1.66 

3 

2 

0 

1.35 

2 

1.38 

2 

2 

1 

1.39 

3 

1.60 

3 

0 

2 

1.05 

2 

1.27 

2 

The  next  section  presents  an  access  method  that  completely  avoids  conflicts  in  the 
concurrent  access  of  vector  streams  of  the  same  subfamily  when  the  system  works  loose 
or  tight.  This  method  also  dramatically  reduces  conflicts  for  other  cases  of  concurrent 
access.  The  name  of  the  proposal  is  Skewed  Sequence  of  memory  Modules  (SSM). 

4  Proposal  SSM 

To  reduce  the  number  of  memory  module  conflicts,  we  propose  that  concurrent  vector 
streams  reference  the  memory  modules  with  the  same  order.  All  the  vector  streams  of  a 
subfamily  reference  the  same  set  of  P^  memory  modules  (P^  =  M/gcd(A/,2')),  but  with 
the  classical  access,  the  order  every  vector  uses  to  access  them  depends  on  the  a-factor 
of  the  stride.  We  propose  to  construct  a  ©-independent  access  order,  then  all  the  vector 
streams  of  a  subfamily  will  reference  the  P,  modules  with  the  same  order. 

To  avoid  section  conflicts,  this  ©-independent  access  order  must  be  constructed 
considering  that  the  resulting  CMR  sets  must  comprise  memory  modules  mapped  in 
different  sections. 

This  new  sequence  ol  memory  modules  will  be  called  SSM  {Skewed  Sequence  of 
memory  Modules)  Fig  4  nPows  the  SSM  proposed  for  different  subfamilies  in  a 
memory  system  that  ha>  W=  16.  n,.=4  and  SC=4  (Fig.  1).  For  every  sequence  SSM  it  is 
also  shown  the  sequence  ot  sections  referenced  and  the  corresponding  CMR. 


Subfamily  SF^q  (odd  andn. 

SSM  0  1  2  I  ''  4  S  6 

sections  0  I  2  '  !  ■  0  1  2 

subpenod  0  tmkperiod  1 

Subfamily  SF^j  (even  stndn.  even  modules) 

CMR  =  1(0.101.  1-4. U/ 


odules)  -  CMR  =  ({0.7.10.131.  (1.4. 1 1.14).  (2.5.KI5I.  (3.6.9.12)1 


10  11  8  9 
2  3  0  1 

subperiod  2 


13  14  15  12 
12  3  0 

subperiod  3 


Subfamily  SF  /  (even  strides,  odd  modules) 

CMR  =  1(1.1  K.  (5.15).  (7.13).  (3.9)) 


SSM  0  2  4  6 

sections  0  2  0  2 

subperiod  0 


to  8  14  12 

2  0  2  0 
suhperiod  1 


SSM 

sections 


13  7  5 

13  3  1 

subperiod  0 


11  9  13  15 

3  I  I 
subperiod  1 


Fig.  4.  16-way  interleaved  memory  with  n^.  =  4  and  5C=4.  SSM  for  several  subfamilies. 
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Each  one  of  the  SSM  we  propose  has  n,.  CMR  sets  of  PJn^.  memory  modules.  In 
consequence,  concurrent  vector  streams  of  a  subfamily  can  concurrently  reference 
memory  modules  of  different  sections,  avoiding  section  conflicts.  Besides,  module 
conflicts  are  also  avoided  as  consecutive  references  to  a  CMR  are  distanced  n^.  cycles. 

Fig.  5  shows  the  conflict-free  access  of  four  odd-strided  vector  streams  in  the  system 
of  Fig.  1,  when  the  corresponding  SSM  is  used.  This  SSM  has  n(.=4  CMR  sets  with  PJn^. 
=  16/4  modules,  so  four  odd-strided  vector  streams  could  have  a  conflict-free  access. 
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Fig.  5.  16-way  interleaved  memory  system  with  =  4  and  SC=4.  Conflict-free 
concurrent  access  of  four  odd-strided  vector  streams  using  SSM. 

Vector  streams  of  Fig.  5  start  their  concurrent  access  in  correct  memory  modules 
(same  CMR),  so  the  concurrent  access  synchronizes  from  the  beginning.  When  the  start 
addresses  do  not  correspond  to  a  CMR,  an  arbitration  algorithm  is  necessary.  Section  4.2 
presents  a  dynamic  arbitration  that  forces  vector  streams  to  concurrently  access  memory 
modules  of  the  appropriate  CMR  [3]. 

4.1  Skewed  Sequence  of  memory  Modules  -  SSM 

The  new  sequence  of  memory  modules  is  called  "Skewed"  as  the  SSM  we  define  tor 
every  subfamily  is  the  result  of  applying  a  skew  function  to  the  subfamily  MMS 
lexicographically  ordered. 

Definition  6:  For  a  vector  stream  A  =  (AO,  S,  PJ,  of  the  subfamily  SF^(> ,  (Mq  =  Aq  mod 
gcd(A/,2'',)/,  we  call  Skewed  Sequence  of  memory  Modules  (SSM)  to  the  sequence 
determined  by  the  expression: 

k  =f[nii)  =  ((m,+L«i/«,  Jimodn,.-i-Lm/n,.Jxn,.ygcd(A/,2''I, 

where  k  is  the  position  that  the  memory  module  m,-  (0<mj<M)  occupies  in  the  sequence 
and  mi  belongs  to  the  vector  stream  MMS. 

The  function that  gives  the  memory  module  from  a  position  in  the  sequence 
(reverse  function  offm,)).  will  permit  to  generate  the  SSM  sequence.  We  express  f’(mi) 
as  an  algorithm,  but  before  presenting  it,  we  will  make  some  considerations  (Fig.  6 
helps  to  follow  the  explanation): 

*  The  first  module  a  vector  references  with  the  SSM  is  Mq  =  Aq  mod  gcd(Af,2'j. 

•  Every  set  of  r/i,2gcd(M,2'jl  consecutive  memory  modules  of  the  MMS 
lexicographically  ordered  suffers  a  skew.  We  call  GS  to  every  one  of  these  sets,  and 
in  a  SSM  there  are  (M/n,.)^gc6(M,2'^)/n,^  GS  sets. 


373 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


•  The  same  skew  is  applied  to  gcdfM,2'j  consecutive  GS  sets,  but  the  first  skew  is 
applied  to  at  most  gcd('M,2-V  consecutive  GS.  If  Mq  is  not  the  memory  module  0, 
only  the  gcdfAf,2'^j-A/^  first  GS  sets  suffer  the  same  skew. 

To  give  an  example  of  the  former  considerations,  ina  system  with  M=16,  n^.  =  4  and 
5C=4,  the  SSM  of  the  subfamily  has  )/n^^  =  AGS  sets.  The  first 

skew  is  applied  only  to  gcd('A/.2''j-Mo=l  GS  as  Mq  is  the  memory  module  1,  but  the 
second  skew  is  applied  to  gcd('Af,2')=2  consecutive  GS. 

Subfamily  SFq  (odd  strides,  all  modules) 
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Fig.  6.  16- way  interleaved  memory  system  with  n,.  =  4  and  5C=4.  SSM  construction  for 
several  subfamilies. 


The  algorithm  used  to  generate  the  SSM  sequence  for  any  subfamily  is: 

Mq  =  Ao  mod  gcd(A/,2') 
control  =  Mq 

skew  =  0 

for  NGS  =  MqI  n^.  to  Mln^.  -1  step\  gcd(M,2’^  }/n^~\ 

for  1  =  Mq  mod  n,.  to  n,.- 1  step  gcd(Af,2') 

module  =  ((I  -  skew  x  gcd(M,2''))  mod  n,.  +  NGS  x  n,.)  mod  M 

endfor 

control  =  (control  +rgcd(M,2')/«,.l)  mod  gcd(Af,2') 
if  (  control  =  0  )  then  skew  =  skew  +  I 

endfor 

In  the  algorithm,  NGS  controls  the  generation  of  the  memory  module  references  for 
every  GS  set.  The  variable  I  controls  the  generation  of  the  memory  module  references 
within  a  GS  set.  Control  controls  the  skew  changes  after  the  generation  of  gcd(A/,2'^j 
consecutive  GS.  It'  Mq  is  not  the  memory  module  0,  only  the  gcd(M,2'‘)-MQ  first  sets  GS 
suffer  the  same  skew. 

4.2  Arbitration  algorithm 

An  arbitration  algorithm  is  needed  in  order  to  synchronize  vector  streams  to  reach  a 
conflict-free  steady-state  phase,  or  to  dramatically  reduce  inter-conflicts,  for  any 
combination  of  initial  memory  modules. 

The  SSM  sequences  can  be  divided  in  PJiii  suhperiods  of  memory  modules.  In 
Fig.  4,  we  can  observe  that  each  suhperiod  of  a  SSM  references  the  sections  following  a 
predetermined  order  which  is  ditlerent  for  every  suhperiod.  Thus,  in  the  concurrent 
access  of  Pfn^.  vector  streams  of  a  subfamily,  we  obtain  a  conflict-free  access  if  we 
overlap  diflerent  subperiods  (different  sections  are  simultaneously  referenced  as  in  the 
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example  of  Fig.  5  with  family  Fp).  The  main  idea  is  that,  in  every  cycle  concurrent 
vector  streams  reference  memory  modules  of  a  different  subperiod,  and  these  different 
subperiods  must  be  aligned. 

The  arbitration  algorithm  controls  the  subperiod  changes  between  vector  streams; 
when  all  subperiod  changes  have  been  detected  for  all  the  vector  streams,  subperiods 
are  assigned  using  a  fixed  priority.  The  subperiod  change  is  detected  by  computing  the 
expression  subperiod=[.ni/(n^y.gc&(M,2^))\rrtodSC  (in,  =  modM)  for  two 
consecutive  memory  module  requests  of  a  vector  stream  (the  current  and  the  previous). 

4.3  Skew  Sequence  of  memory  References  -  SSR 

The  SSM  is  the  order  in  which  memory  modules  must  be  referenced  periodically,  then 
vector  stream  memory  references  must  be  generated  to  periodically  access  the  modules 
with  this  new  order. 

Definition  7:  For  a  vector  stream  A  =  (AO,  S)  of  the  subfamily  SF^^  (Mp  =  Ap  mod 
gcd(M,2'^)),  the  Skewed  Sequence  of  References  (SSR)  is  the  sequence  of  memory 
references  that  permits  to  reference  the  memory  modules  following  the  SSM  periodically. 

The  algorithm  that  generates  SSR  is  a  modification  of  the  algorithm  that  generates 
SSM.  The  following  definition  will  help  designing  the  algorithm. 

Definition  8:  The  order  number  (ON)  of  a  vector  stream  element,  is  the  position  on  which 
its  address  is  generated  using  the  classical  access,  0  <  ON  <  VL. 

The  address  of  an  element  of  a  vector  stream  A  =  (Ap,  S),  can  be  computed  using  its 
order  number  as  Addr  =  Ap  +  ON  X  5.  With  the  classical  access,  addresses  of  elements 
with  consecutive  ON  are  consecutively  generated  (ONj.^.]  =  ONj  +  1).  This  is  not  the 
case  with  the  SSR,  but,  if  we  know  how  to  generate  the  sequence  of  order  numbers  that 
fulfil  SSM,  we  will  be  able  to  generate  SSR. 

First,  we  suppose  SSR  is  P^  references  long,  then  we  extend  the  study  to  any  length. 

Pg  references  long  (one  Period) 

Vector  elements  placed  in  memory  modules  adjacent  in  the  MMS  lexicographically 
ordered  have  order  numbers  separated  by  a  constant  distance.  Q  [3].  Then,  we  can 
compute  the  ON  of  a  vector  element  from  the  ON  of  any  other  vector  element  if  we 
know  the  distance  between  the  memory  modules’  where  they  both  are  placed:  ONj  = 
ONj  +  Ky.  Cj,  where  K  is  the  distance. 

To  compute  the  sequence  of  order  numbers  the  SSM  implies,  the  K  we  can  use  can  be 
the  distance  between  the  memory  module  to  be  referenced  and  the  first  memory  module 
referenced  using  the  SSM  that  is  Mp  =  Ap  mod  gcd(A/,2^).  Then,  we  must  use  the  order 
number  of  the  first  vector  stream  element  referenced  using  the  SSR,  NOO,  that  can  be 
easily  computed.  In  this  case,  the  order  number  of  a  vector  element  placed  in  the 
memory  module  nij  is: 

ONj  =  NOO  +  ((nij  -  Mp)mod  M  / gcd(M,2^))  x  C^. 

Any  Length  (any  number  of  Periods) 

As  the  distance  between  memory  modules  can  be  computed  within  a  period,  the  former 
recurrence  actually  gives  the  order  number  relative  to  a  period  (ONR).  To  extend  the 


1.  distance  is  the  number  of  memory  modules  between  them  in  the  MMS  lexicographi¬ 
cally  ordered. 
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computation  of  the  order  number  to  any  number  of  periods,  we  can  consider  that  every 
period  has  a  base  order  number  (BM),  to  be  added  to  the  ONR  to  obtain  the  ON.  From 
period  to  period  this  BN  must  be  increased  in  units. 

The  next  algorithm  is  based  in  the  algorithm  proposed  in  Section  4.1,  adding  the 
computation  of  the  order  number  and  the  loop  that  controls  the  period.  The  bold  lines 
are  the  ones  added. 

M()  =  A0  mod  gcdfA/,2'J 

BN  =  0 

for  Q  =  O'fo  r VUP,^  -1 
control  =  Mq 
skew  =  0 

for  NGS  =  A/f/  n,.  to  Min,.  •  I  step  f  gcdf  1 

for  I  =  Mq  mod  n,.  to  step  gcd( M,2'‘) 

module=((I-skewxgcdCA/,2'J)mod  n^.+NGSxn,.)  mod  M 
K  =  ((module  -  Mq)  mod  M)lgcd{M,2’‘) 

ONR  =  (ONO  +  K  X  C,)  mod  P, 

Addr®®’‘=  A„  +  (BN  +  ONR)  x  S 
endfor 

control  =  (control  +  f gcd(Af,2' mod  gcd(iW,2') 
if  (  control  =  0  )  then  skew  =  skew  +  1 

endfor 

BN  =  BN  +  P, 

endfor 

As  a  synopsis,  the  recurrences  that  compute  the  vector  memory  references  are: 

=  Ao  +  Base.Addr  +  Af  and  Af  =  (Af  +  KxC,xS)  mod  (P,  x  S) 

where  A[  is  the  vector  element  address  relative  to  a  period,  A,^^^  is  its  absolute  address, 
K  is  the  distance  between  memory  modules  where  A[  and  Af  are  placed,  and  Base_Addr 
is  the  base  address  of  a  period  (BN  x  S). 

4.4  Hardware  Support  to  Reduce  Conflicts 

To  design  the  hardware  that  computes  the  55/?,  we  must  rewrite  the  algorithm  to  make  it 
easier  to  implement.  There  are  two  issues  that  must  be  solved:  the  presence  of  a 
multiplier  and  a  modulo  operation  in  the  critical  path  of  the  address  computation  (every 
iteration). 

To  avoid  the  use  of  a  multiplier,  the  relative  addresses  Af  are  computed  using  the 
relative  addresses  Af ,  so  only  two  precomputed  products  KxC^xS  mod  (F’^  x  5)  must 
be  used  (A’=l  and  K-n, ).  This  implies  using  three  registers  to  store  different  previous 
values  Af . 

The  modulo  operation  (mod  (P,  x  5))  can  be  performed  by  subtracting  x  5  if 
necessary,  as  demonstrated  in  [5].  In  fact,  the  two  values,  Af  +KxC^xS  and  Af  +  Kx 
Q.  X  5  -  X  5,  are  computed  in  parallel,  and  the  selection  between  them  is  performed 
by  a  signal  that  comes  from  the  vector  register  index  computation  [5].  This  signal 
indicates  it  ONJ  +  K  x  C^.  >  P,..  easier  to  compute  as  P^  is  a  power  of  two  number. 

Fig.  7  shows  a  hardware  design  of  the  data-path.  The  hardware  cost  is  moderate,  two 
adders  in  the  critical  path  and  a  CSA.  and  it  is  not  more  complex  than  that  needed  by 
other  solutions  [8][18]  proposed  to  reduce  the  average  memory  latency  time  in  vector 
processors. 
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The  rate  at  which  a  memory  request  can  be  issued  is  limited  by  the  rate  at  which 
additions  can  be  performed.  The  design  can  be  pipelined  to  obtain  a  reduction  of  the 
cycle  time  (this  would  be  also  needed  in  the  classical  access).  The  additional  hardware 
introduces  a  initial  delay  of  a  few  cycles  in  the  memory  path.  The  number  of  clock 
cycles  needed  to  access  the  memory  is  of  the  order  of  J4  +  MVL  for  the  CRAY  X-MP, 
77  +  MVL  for  the  CRAY  Y-MP  and  23  +  MVL  for  the  C90  [20],  However,  as  the 
processor  speed  continues  to  increase  faster  than  the  memory  speed,  an  extra  initial 
delay  of  some  cycles  introduced  by  the  hardware  proposed  is  acceptable. 

The  number  of  parameters  to  be  calculated  is  comparable  to  the  number  needed  for 
other  proposals  [8][1 8][22],  and  most  of  them  can  be  determined  by  the  compiler. 

The  hardware  needed  to  access  the  vector  registers  is  similar  to  the  hardware  shown 
at  Fig.  7  but  simpler. 

The  cost  of  the  hardware  components  can  be  considered  a  minor  part  of  the  cost  of 
the  memory  subsystem.  Additionally,  in  contrast  with  other  solutions,  which  include  a 
significant  number  of  buffers  to  eliminate  the  effect  of  unsuitable  temporal  distributions 
[8][18],  this  proposal  does  not  need  buffers. 

5  New  method  performance 

In  this  section  we  present  the  advantages  of  the  method  proposed  in  this  paper.  Tab.  2 
shows  the  comparison  between  the  SSM  and  the  classical  access  in  a  memory  system 
with  M=16  memory  modules,  interleavedly  mapped  in  SC=4  sections,  with  an  /j,.=4. 
Some  considerations  about  the  simulations: 

a)  We  obtain  the  value  R„  for  the  concurrent  access,  using  the  classical  access  and  the 
proposal,  of  all  the  possible  combinations  of  four,  three  or  two  vector  streams  of  the 
families  Fq  and  S/^ 

b)  All  the  combinations  of  vector  streams  whose  concurrent  access  has  been  simulated 
have  a  non  void  intersection  of  MMS  sets. 

c)  The  parameter  we  use  to  perfonn  the  comparison  is  the  increment  in  performance 
(/7?«,)  implied  by  the  proposal,  and  il  is  computed  as  IR„  =  ((R^s.SM  ' 

d)  The  results  presented  under  the  name  R^  are  harmonic  means  of  the  asymptotic 
number  of  operations  per  cycle  that  the  classical  access  and  the  SSM  obtain  for 
combinations  of  vector  streams  we  group  in  types. 
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Tab.  2  presents  the  for  several  types  of  vector  stream  combinations  the  classical 
access  and  the  SSM  obtain  in  a  1 6-way  interleaved  memory  system  with  n,.  =  4  and 
SC=4.  The  table  also  shows  the  maximum  number  of  operations  per  cycle  (IR„  Ideal) 
that  could  be  ideally  obtained  for  every  combination  in  the  supposed  memory  system. 
The  increment  in  performance  the  SSM  implies  is  presented  in  the  column  labelled  as 

IR^. 

Tab.  2.  16- way  interleaved  memory  system  with  n,.  =  4  and  SC=4.  R„  and  IR^  for  SSM. 


Fiaai 

IR^ 

SSR 

4 

'KH 

4 

I..‘i7 

3.9.3 

1 .32% 

3 

3 

1.-31 

2.98 

9=1% 

2 

2 

I..3.3 

1.99 

mmm\ 

2 

1 

3 

1..39 

1.99 

43% 

1 

1 

2 

l.lg 

1.33 

13%- 

1 

2 

2.4 

1.19 

1.99 

67%; 

2.67 

I..34 

2.6.3 

98%; 

n 

2 

1 .0.3 

_ LM 

90%. 

IRB 

2 

1.03 

Hnsi 

46% 

iBEB 

4 

2 

1.04 

1.99 

91% 

In  the  table,  the  types  with  an  asterisk  ('*’)  correspond  to  combinations  of  vector 
streams  of  the  same  subfamily  that  make  the  system  work  loose  or  tight.  For  these  types 
the  R„  the  SSM  obtains  is  almost  R„  Ideal,  and  the  IR^  is  very  important,  between  47% 
and  152%.  For  the  other  types,  the  IR^  is  also  important,  between  1 3%  and  98%. 

Fig.  8.a  presents  the  IR„  the  use  of  the  SSM  implies  in  function  of  the  a-factor  of  the 
stride,  in  the  concurrent  access  of:  four  vector  streams  of  the  family  Fq  (dark  bars),  four 
vector  streams  of  the  subfamily  SF^  (medium  grey  bars),  and  two  vector  streams  of  Fq 
with  two  vector  streams  of  SF*j  (light  bars).  For  every  case  we  grouped  combinations 
that  have  four  (bars  labelled  as  “four”),  three,  two  o  zero  (bars  labelled  as  “zero”)  vector 
stream  with  the  same  a-factor. 


(a)  (b) 


Number  o  f  vector  streams  with  the  same  o 

Fig.  8.  16-way  interleaved  memory  system  with  n,.  =  4  and  5C=4.  //?„,  for  55A/,  in  the 
concurrent  access  of  four  (a)  or  three  vector  streams  (b). 
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When  four  Fq  vector  streams  (odd-strides,  dark  bars)  access  the  memory  system,  the 
memory  works  tight,  but  the  concuiTent  access  with  the  SSM  is  conflict-free  and  IR„  is 
substantial,  between  85%  and  159%.  Even  when  all  the  concurrent  vector  streams  have 
the  same  0-factor  (same  stride),  SSM  overworks  the  classical  access,  as  this  access  does 
not  avoids  section  conflicts. 

For  combinations  of  four  Spf  vector  streams  (even-strides,  medium  grey  bars)  the 
concurrent  access  with  the  SSM  is  not  conflict-free  as  there  are  more  than  P/Jn^.  (=8/4=2) 
concurrent  vector  streams,  but  the  //?„<,  is  important,  between  69%;  and  105%;. 

When  in  the  concurrent  access  there  are  two  Fp  vector  streams  and  two  SFf  vector 
streams  the  concurrent  access  with  the  SSM  is  not  conflict-free  as  there  are  vector 
streams  of  different  subfamilies  but  the  1R„  is  important,  it  ranges  from  69%  and  104% 

Fig.  8.b  presents  the  IR^  the  SSM  represents  in  function  of  the  0-factor,  in  the 
concurrent  access  of;  three  vector  streams  of  the  family  Fq  (dark  bars),  three  vector 
streams  of  the  subfamily  (medium  grey  bars),  and  two  vector  streams  of  Fq  with 
one  vector  stream  of  SFf  (light  bars).  For  every  case  we  grouped  combinations  that 
have  three  (bars  labelled  as  “three”),  two  or  zero  (bars  labelled  as  “zero”)  vector  stream 
with  the  same  0-factor  in  the  stride.  For  these  cases,  the  IR„  the  SSM  obtains  is  lower 
than  in  the  case  of  four  vector  streams,  as  the  classical  access  finds  the  system  working 
looser  and,  in  consequence,  there  are  less  conflicts  or  they  have  less  effect. 

Vectors  and  matrices  are  the  most  common  data  structures  in  vector  processors.  In 
Fortran,  the  most  frequent  acces.ses  to  matrices  are  made  by  columns,  rows  and 
diagonals,  that  correspond  to  the  strides  1,  n  and  n+1  respectively,  where  n  is  the 
column  length,  which  is  dependent  on  the  problem  size  that  varies  widely.  Present 
compilation  technology  detects  if /r  is  even,  then  the  matrix  size  can  be  increased  in  one 
row  (odd  stride),  and  the  number  of  referenced  memory  modules  is  M.  Thus,  in  row- 
major  and  column-major  accesses  the  use  of  SSM  performs  equally  well,  and  there  are 
no  conflicts.  When  n  is  even  and  there  is  no  possibility  of  increasing  the  number  of 
rows,  the  SSM  reduces  the  number  of  conflicts. 

6  Conclusions 

The  interferences  between  concurrent  vector  streams  accessing  the  memory  system  of  a 
vector  ormultivecio»  pfoccscor  cause  conflicts  in  the  memory  that  reduce  the  processor 
efficiency. 

The  present  paper  proposed  a  0-independent  access  order  to  the  vector  stream 
elements  (SSM).  for  vchivh  4JI  the  vector  streams  of  a  subfamily  reference  the  memory 
modules  with  the  ^ame  order  The  use  of  the  SSM  associated  with  the  proposed 
arbitration  algorithm,  a^onl^  conflicts  when  the  concurrent  access  correspond  to  vector 
streams  of  the  same  NuMamily  and  the  system  works  loose  or  tight.  The  proposal 
significantly  reduces  tontlKiv  lor  other  types  of  concurrent  acces.ses. 

The  hardware  solution  th.ii  generates  the  SSM  and  the  hardware  used  to  access  the 
vector  registers  have  a  modci  ate  co.st. 

The  simulations  conlirmcd  that  the  proposal  can  achieve  the  maximum  number  of 
operations  per  cycle,  and  the  results  showed  that  the  SSM  always  outperforms  the 
classical  access,  witli  pcriormance  increments  between  l.^%;  and  152%  for 
combinations  of  even  and  odd  strided  vector  streams.  In  the  interesting  case  of  the 
concurrent  access  of  4  one-sirided  vector  streams  the  increment  in  performance  is  85%. 
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Abstract  Multilevel  algorithms  are  a  successful  class  of  optimisation  techniques 
which  address  the  mesh  partitioning  problem.  They  usually  combine  a  graph  con¬ 
traction  algorithm  together  with  a  local  optimisation  method  which  refines  the  par¬ 
tition  at  each  graph  level.  To  date  these  algorithms  have  been  used  almost  exclu¬ 
sively  to  minimise  the  cut-edge  weight,  however  it  has  been  shown  that  for  certain 
classes  of  solution  algorithm,  the  convergence  of  the  solver  is  strongly  influenced 
by  the  subdomain  aspect  ratio.  In  this  paper  therefore,  we  modify  the  multilevel 
algorithms  in  order  to  optimise  a  cost  function  based  on  aspect  ratio.  Several  vari¬ 
ants  of  the  algorithms  are  tested  and  shown  to  provide  excellent  results. 


1  Introduction 

The  need  for  mesh  partitioning  arises  naturally  in  many  finite  element  (FE)  and  finite 
volume  (FV)  applications.  Meshes  composed  of  elements  such  as  triangles  or  tetrahe- 
dra  are  often  better  suited  than  regularly  structured  grids  for  representing  completely 
general  geometries  and  resolving  wide  variations  in  behaviour  via  variable  mesh  densi¬ 
ties.  Meanwhile,  the  modelling  of  complex  behaviour  patterns  means  that  the  problems 
are  often  too  large  to  fit  onto  serial  computers,  either  because  of  memory  limitations  or 
computational  demands,  or  both.  Distributing  the  mesh  across  a  parallel  computer  so  that 
the  computational  load  is  evenly  balanced  and  the  data  locality  maximised  is  known  as 
mesh  partitioning.  It  is  well  known  that  this  problem  is  NP-complete,  so  in  recent  years 
much  attention  has  been  focused  on  developing  suitable  heuristics,  and  some  powerful 
methods,  many  based  on  a  graph  corresponding  to  the  communication  requirements  of 
the  mesh,  have  been  devised,  e.g.  [12]. 

A  particularly  popular  and  successful  class  of  algorithms  which  address  this  mesh 
partitioning  problem  are  known  as  multilevel  algorithms.  They  usually  combine  a  graph 
contraction  algorithm  which  creates  a  series  of  progressively  smaller  and  coarser  graphs 
together  with  a  local  optimisation  method  which,  starting  with  the  coarsest  graph,  refines 
the  partition  at  each  graph  level.  These  algorithms  have  been  used  almost  exclusively 
to  minimise  the  cut-edge  weight,  a  cost  which  approximates  the  total  communications 
volume  in  the  underlying  solver.  This  is  an  important  goal  in  any  parallel  application, 
to  minimise  the  communications  overhead,  however,  it  has  been  shown,  [18],  that  for 
certain  classes  of  solution  algorithm,  the  convergence  of  the  solver  is  actually  heavily 
influenced  by  the  shape  or  aspect  ratio  (AR)  of  the  subdomains.  In  this  paper  therefore, 
we  modify  the  multilevel  algorithms  (the  matching  and  local  optimisation)  in  order  to 
optimise  a  cost  function  based  on  AR.  We  also  abstract  the  process  of  modification  in 
order  to  suggest  how  the  multilevel  strategy  can  be  modified  into  a  generic  technique 
which  can  optimise  arbitrary  cost  functions. 
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1.1  Domain  decomposition  preconditioners  and  aspect  ratio 

To  motivate  the  need  for  aspect  ratio  we  consider  the  requirements  of  a  class  of  solu¬ 
tion  techniques.  A  natural  parallel  solution  strategy  for  the  underlying  problem  is  to  use 
an  iterative  solver  such  as  the  conjugate  gradient  (CG)  algorithm  together  with  domain 
decomposition  (DD)  preconditioning,  e.g.  [2],  DD  methods  take  advantage  of  the  par¬ 
tition  of  the  mesh  into  subdomains  by  imposing  artificial  boundary  conditions  on  the 
subdomain  boundaries  and  solving  the  original  problem  on  these  subdomains,  [4],  The 
subdomain  solutions  are  independent  of  each  other,  and  thus  can  be  determin^  in  par¬ 
allel  without  any  communication  between  processors.  In  a  second  step,  an  ‘interface’ 
problem  is  solved  on  the  inner  boundaries  which  depends  on  the  jump  of  the  subdomain 
solutions  over  the  boundaries.  This  interface  problem  gives  new  conditions  on  the  inner 
boundaries  for  the  next  step  of  subdomain  solution.  Adding  the  results  of  the  third  step 
to  the  first  gives  the  new  conjugate  search  direction  in  the  CG  algorithm. 

The  time  needed  by  such  a  preconditioned  CG  solver  is  determined  by  two  factors, 
the  maximum  time  needed  by  any  of  the  subdomain  solutions  and  the  number  of  itera¬ 
tions  of  the  global  CG.  Both  are  at  least  partially  determined  by  the  shape  of  the  subdo¬ 
mains.  Whilst  an  algorithm  such  as  the  multigrid  method  as  the  solver  on  the  subdomains 
is  relatively  robust  against  shape,  the  number  of  global  iterations  are  heavily  influenced 
by  the  AR  of  subdomains,  [17].  Essentially,  the  subdomains  can  be  viewed  as  elements 
of  the  interface  problem.  [7, 8],  and  just  as  with  the  normal  finite  element  method,  where 
the  condition  of  the  matrix  system  is  determined  by  the  AR  of  elements,  the  condition 
of  the  preconditioning  matrix  is  here  dependent  on  the  AR  of  subdomains. 


1.2  Overview 

Below,  in  Section  2,  we  introduce  the  mesh  partitioning  problem  and  establish  some  ter¬ 
minology.  We  then  discuss  the  mesh  partitioning  problem  as  applied  taAR  optimisation 
and  describe  how  the  graph  needs  to  be  modified  to  carry  this  out.  Next,  in  Section  3, 
we  describe  the  multilevel  paradigm  and  present  and  compare  three  possible  matching 
algorithms  which  take  account  of  AR.  In  Section  4  we  then  describe  a  Kernighan-Lin 
(KL)  type  iterative  local  optimisation  algorithm  and  describe  two  possible  modifications 
which  aim  to  optimise  AR.  Finally  in  Section  5  we  compare  the  results  with  a  cut  edge 
partitioner,  suggest  bo^*  the  multilevel  strategy  can  be  modified  into  a  generic  technique 
and  present  some  ideas  lor  further  investigation. 

The  principal  inn<<\  jiums  described  in  this  paper  are: 

-  In  §2.2  we  describe  the  graph  can  be  modified  to  take  AR  into  account. 

-  In  §3.2  we  descrifv  three  matching  algorithms  based  on  AR. 

-  In  §4.3  we  describe  i»o  »ays  of  using  the  cost  function  to  optimise  for  AR. 

-  In  §4.4  we  describe  fv'*  the  bucket  sort  can  be  modified  to  take  into  account  non¬ 
integer  gains. 

2  The  mesh  partitioning  problem 

To  define  the  mesh  partitioning  problem,  let  6'  =  G(1 '  E)  be  an  undirected  graph  of 
vertices  I' ,  with  edges  I.  \chich  represent  the  data  dependencies  in  the  mesh.  We  assume 
that  both  vertices  and  edges  can  be  weighted  (with  positive  integer  values)  and  that  |t)| 
denotes  the  weight  of  a  vertex  v  and  similarly  for  edges  and  sets  of  vertices  and  edges. 
Given  that  the  mesh  needs  to  be  distributed  to  P  processors,  define  a  partition  tt  to  be  a 
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mapping  of  V'  into  P  disjoint  subdomains  Sp  such  that  |Jp  Sp  —  V.  To  evenly  balance 
the  load,  the  optimal  subdomain  weight  is  given  by  5  :=  f|V'|/P]  (where  the  ceiling 
function  [x]  returns  the  smallest  integer  >  x)  and  the  imbalance  is  then  defined  as  the 
maximum  subdomain  weight  divided  by  the  optimal  (since  the  computational  speed  of 
the  underlying  application  is  determined  by  the  most  heavily  weighted  processor). 

The  definition  of  the  mesh-partitioning  problem  is  to  find  a  partition  which  evenly 
balances  the  load  or  vertex  weight  in  each  subdomain  whilst  minimising  some  cost  func¬ 
tion  r.  Typically  this  cost  function  is  simply  the  total  weight  of  cut  edges,  but  in  this 
paper  we  describe  a  cost  function  based  on  AR.  A  more  precise  definition  of  the  mesh¬ 
partitioning  problem  is  therefore  to  find  tt  such  that  Sp  <  S  and  such  that  P  is  min¬ 
imised. 


2.1  The  aspect  ratio  and  cost  function 

We  seek  to  modify  the  methods  by  optimising  the  partition  on  the  basis  of  AR  rather  than 
cut-edge  weight.  In  order  to  do  this  it  is  necessary  to  define  a  cost  function  which  we  seek 
to  minimise  and  a  logical  choice  would  be  maxp  AR(5p),  where  AR(5p)  is  the  AR  of 
the  subdomain  Sp.  However  maximum  functions  are  notoriously  difficult  to  optimise 
(indeed  it  is  for  this  reason  that  most  mesh  partitioning  algorithms  attempt  to  minimise 
the  total  cut-edge  weight  rather  than  the  maximum  between  any  two  subdomains)  and 
so  instead  we  choose  to  minimise  the  average  AR 

=  (I) 

T 

There  are  several  definitions  of  AR,  however,  and  for  example,  for  a  given  poly¬ 
gon  S,  a  typical  definition,  [15],  is  the  ratio  of  the  largest  circle  which  can  be  contained 
entirely  within  S  (inscribed  circle)  to  the  smallest  circle  which  entirely  contains  S  (cir- 
cumcircle).  However  these  circles  are  not  easy  to  calculate  for  arbitrary  polygons  and 
in  an  optimisation  code  where  ARs  may  need  to  be  calculated  very  frequently,  we  do 
not  believe  this  to  be  a  practical  metric.  It  may  also  fail  to  express  certain  irregularities 
of  shape.  A  careful  discussion  of  the  relative  merits  of  different  ways  of  measuring  AR 
may  be  found  in  [  1 6]  and  for  the  purposes  of  this  paper  we  follow  the  ideas  therein  and 
define  the  AR  of  a  given  shape  by  measuring  the  ratio  of  its  perimeter  length  (surface 
area  in  3d)  over  that  ol  some  ideal  shape  with  identical  area  (volume  in  3d). 

Suppose  then  that  in  2d  the  ideal  shape  is  chosen  to  be  a  square.  Given  a  polygon  S 
with  area  QS  and  perimeter  length  dS,  the  ideal  perimeter  length  (the  perimeter  length 
of  a  square  with  area  ilS)  is  A\/QS  and  so  the  AR  is  defined  as  dS/iy/JiS.  Alterna¬ 
tively,  if  the  ideal  shape  is  chosen  to  be  a  circle  then  the  same  argument  gives  the  AR  of 
dSj2\/TtQS.  In  fact,  given  the  definition  of  the  cost  function  (1)  it  can  be  seen  that  these 
two  definitions  will  produce  the  same  optimisation  problem  (and  hence  the  same  results) 
with  the  cost  Just  modified  by  a  constant  C  (where  C  =  1/4  for  the  square  and  l/2-»/7r 
for  circle).  These  definitions  of  AR  are  easily  extendible  to  3d  and  given  a  polyhedron 
S  with  volume  fiS  and  surface  area  OS,  the  AR  can  be  calculated  as  CdS/{QS)~^^ , 
where  C  =  1/4  if  the  cube  is  chosen  as  the  optimal  shape  and  C  =  1  / (47r)^^^3''^^  for 
the  sphere.  Note  that  henceforth,  in  order  to  talk  in  general  terms  for  both  2d  &  3d,  given 
an  object  S  we  shall  use  the  terms  dS  or  surface  for  the  surface  area  (3d)  or  perimeter 
length  (2d)  of  the  object  and  HS  or  volume  for  the  volume  (3d)  or  area  (2d). 
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Of  the  above  definitions  of  AR  we  choose  to  use  the  square/cube  based  formulae  for 
two  reasons;  firstly  because  we  are  attempting  to  partition  a  mesh  into  interlocking  sub- 
domains  (and  circles/spheres  are  not  known  for  their  interlocking  qualities)  and  secondly 
because  it  gives  a  convenient  formula  for  the  cost  function  of; 


r  -  i-  V 

icniplaic  ”■  -* 


i^Sp 


(2) 


where  C  =  2dP  and  d  (=  2  or  3)  is  the  dimension  of  the  mesh.  We  refer  to  this  cost 
function  as  or  f  t  because  of  the  way  it  tries  to  match  shapes  to  chosen  templates. 

In  fact,  it  will  turn  out  (see  for  example  §3.2)  that  even  this  function  may  be  too 
complex  for  certain  optimisation  needs  and  we  can  define  a  simpler  one  by  assuming 
that  all  subdomains  have  approximately  the  same  volume,  ^2Sp  ss  QM/P,  where  QM 
is  the  total  volume  of  the  mesh.  This  assumption  may  not  necessarily  be  true,  but  it  is 
likely  to  be  true  locally  (see  §4.5).  We  can  then  approximate  (2)  by 


r,,., 


ru 


(3) 


where  C'  =  UP'S  {QM)~ .  This  can  be  simplified  still  further  by  noting  that  the 
surface  of  each  subdomain  Sp  consists  of  two  components,  the  exterior  surface,  Sp, 
where  the  surface  of  the  subdomain  coincides  with  the  surface  of  the  mesh  dM,  and  the 
mrenor  surface,  d'Sp,  where  Sp  is  adjacent  to  other  subdomains  and  the  surface  cuts 
through  the  mesh.  Thus  we  can  break  the  dSp  term  in  (3)  into  two  parts  ^p  d'Sp 
and  XIp  5^5p  and  simplify  (3)  further  by  noting  that  ^p  d^Sp  is  just  dM,  the  exterior 
surface  of  the  mesh  M .  This  then  gives  us  a  second  cost  function  to  optimise; 

(4) 

^  r 

where  Aj  =  2dP'3  [i2M)  a  and  Kn  =  dM/hi.  We  refer  to  this  cost  function  as 
•Innate  or  Pf  becau.se  it  is  just  concerned  with  optimising  surfaces. 


2.2  Modifying  the  graph 


Fig.  1.  Lett  to  right:  a  simple  mesh  (a),  its  dual  (b),  the  same  mesh  with  combined  elements  (c) 
and  its  dual  (d) 
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To  use  these  cost  functions  in  a  graph-partitioning  context,  we  must  add  some  additional 
qualities  to  the  graph.  Figure  1  shows  a  very  simple  mesh  (la)  and  its  dual  graph  (lb). 
Each  element  of  the  mesh  corresponds  to  a  vertex  in  the  graph.  The  vertices  of  the  graph 
can  be  weighted  as  is  usual  (to  carry  out  load-balancing)  but  in  addition,  vertices  store 
thevolumeandtotal  surface  of  theircorresponding  element  (e.g.f?t;]  =  Qeianddvi  = 
dei).  We  also  weight  the  edges  of  the  graph  with  the  size  of  the  surface  they  correspond 
to.  TTius,  in  Figure  1 ,  if  D{b.  c)  refers  to  the  distance  between  points  b  and  c,  then  the 
weight  of  edge  (ri.1'2)  is  set  to  D{b.c).  In  this  way,  for  vertices  u,  corresponding  to 
elements  which  have  no  exterior  surface,  the  sum  of  their  edge  weights  is  equivalent 
to  their  surface  (f)r',  =  r’,)|).  Thus  for  vertex  V2,  5t)2  =  de2  =  D(b,c) + 

D{c,e)  -h  D(e,b)  =  |(r2,r’i)|  -t-  |(t’2.t’3)l  +  |(t'2,t'5)|- 

When  it  comes  to  combining  elements  together,  either  into  subdomains,  or  for  the 
multilevel  matching  (§3)  these  properties,  volume  and  surface  can  be  easily  combined. 
Thus  in  Figure  Ic  where  Ei  =  ei+  £4,  En  —  €3  +  (5  and  E3  =  €3  we  see  that  volumes 
can  be  directly  summed,  for  example  nVi  =  =  i2ei  +  ne4  =  Hvi  +  i^V4,  as  can 

edge  weights,  e.g.  |(V'] .  Vo)!  =  D(b,c]  +  D(c,  d)  =  Kt)!,  i'2)|  +  |(i’4>  The  surface 
of  a  combined  object  S  is  the  sum  of  the  surfaces  of  its  constituent  parts  less  twice  the 
interior  surface,  e.g.  d\\  =  dEi  =  dei  +de4-2  x  D{a,  c)  =  dvi  +dvi  — 2|(7/'i,  t'4)|. 
These  properties  are  very  similar  to  properties  in  conventional  graph  algorithms,  where 
the  volume  combines  in  the  same  way  as  weight  and  surfaces  combine  as  the  sum  of  edge 
weights  (although  including  an  additional  term  which  expresses  the  exterior  surface  d^). 
The  edge  weights  function  identically. 

Note  that  with  these  modifications  to  the  graph,  it  can  be  seen  that  if  we  optimise 
using  the  T,  cost  function  (4),  the  AR  mesh  partitioning  problem  is  identical  to  the  cut- 
edge  weight  mesh  partitioning  problem  with  a  special  edge  weighting.  However,  the  in¬ 
clusion  of  non  integer  edge  weights  does  have  an  effect  on  the  some  of  the  techniques 
that  can  be  used  (e.g.  see  §4.4). 


2.3  Testing  the  algorithms 
Table  1.  Test  meshes _ 


mesh  no.  vertices  no.  edges  type  aspect  ratio  mesh  grading 


uk 

4824 

6837  2d  triangles 

3.39 

7.98e+02 

t60k 

60005 

89440  2d  triangles 

1.60 

2.00e+00 

dime20 

224843 

336024  2d  triangles 

1.87 

3.70e+03 

c,s4 

22499 

43858  3d  tetrahedra 

1.07 

9.64e+01 

mesh  ]  00 

103081 

200976  3d  tetrahedra 

1.63 

2.45e+02 

cyl3 

232362 

457853  3d  tetrahedra 

1.28 

8.42e+00 

Throughout  this  paper  we  compare  the  effectiveness  of  different  approaches  using  a 
set  of  test  meshes.  The  algorithms  have  been  implemented  within  the  framework  of  JOS¬ 
TLE,  a  mesh  partitioning  software  tool  developed  at  the  University  of  Greenwich  and 
freely  available  for  academic  and  research  purposes  under  a  licensing  agreement  (avail¬ 
able  from  http:  //www.gre.ac  .uk/~c.walshaw/ jostle).  The  experiments 
were  carried  out  on  a  DEC  Alpha  with  a  466  MHz  CPU  and  1  Gbyte  of  memory.  Due 
to  space  considerations  we  only  include  6  test  meshes  but  they  have  been  chosen  to  be 
a  representative  sample  ol  medium  to  large  scale  real-life  problems  and  include  both  2d 
and  3d  examples.  Table  1  gives  a  list  of  the  meshes  and  their  sizes  in  terms  of  the  number 
of  vertices  and  edges.  The  table  also  shows  the  aspect  ratio  of  each  entire  mesh  and  the 
mesh  grading,  which  here  we  define  as  the  maximum  surface  of  any  element  over  the 
minimum  surface,  and  these  two  figures  give  a  guide  as  to  how  difficult  the  optimisation 
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may  be.  For  example,  ‘uk’  is  simply  a  triangulation  of  the  British  mainland  and  hence 
has  a  very  intricate  boundary  and  therefore  a  high  aspect  ratio.  Meanwhile,  ‘dime20’ 
which  has  a  moderate  aspect  ratio,  has  been  very  heavily  refined  in  parts  and  thus  has 
a  high  mesh  grading  -  the  largest  element  has  a  surface  around  3,700  times  larger  than 
that  of  the  smallest. 

Table  2.  Final  results  u.sing  template  cost  matching  and  surface  gain/template  cost  optimisation 
P  =  A&  P  =  32  P  =  64  P  =  128 

mesh  r,  \Ec\  t,  P,  |£.|  t.  P,  t,  P,  |£,|  Z 

uk  1.48  206  0.12  1.31  331  0.12  1.23  543  0.22  1.25  917  0.50 

t60k  1.16  1003  1.63  1.10  1547  2.07  1.11  24.37  2.33  1.11  3647  2.65 

dime20  1.22  1623  5.78  1.20  2868  5.17  1.15  4406  5.70  1.12  6620  7.57 

cs4  1.22  2727  0.85  1.22  3738  0.90  1.23  5066  1.12  1.23  6747  1.60 

mesh  100  1.25  5950  3.20  1.24  8752  3.53  1.26  12467  4.13  1.28  17346  5.13 

cyl3  1.21  11141  10.05  1.21  15944  10.77  1.23  22378  13.02  1.22  29719  13.18 

Table  2  shows  the  results  of  the  final  combination  of  algorithms -TCM  (see  §3.2) 
and  SGTC  (see  §4.3)  -  which  were  chosen  as  a  benchmark  for  the  other  combinations. 
For  the  4  different  values  of  P  (the  number  of  subdomains),  the  table  shows  the  average 
aspect  ratio  as  given  by  Ft.  the  edge  cut  lEfl  (that  is  the  number  of  cut  edges,  not  the 
weight  of  cut  edges  weighted  by  surface  size)  and  the  time  in  seconds,  t..,,  to  partition 
the  mesh.  Notice  that  with  the  exception  of  the  ‘uk’  mesh,  all  partitions  have  average 
aspect  ratios  of  less  than  1.30  which  is  well  within  the  target  range  suggested  in  [6]. 
Indeed  for  the  ‘uk’  mesh  it  is  no  surprise  that  the  results  are  not  optimal  because  the 
subdomains  inherit  some  of  the  poor  AR  from  the  original  mesh  (which  has  an  AR  of 
3.39)  and  it  is  only  when  the  mesh  is  split  into  small  enough  pieces,  P  =  64  or  128,  that 
the  optimisation  succeeds  in  ameliorating  this  effect.  Intuitively  this  also  gives  a  hint  as 
to  why  DD  methods  are  a  very  successful  technique  as  a  solver. 


3  The  multilevel  paradigm 

In  recent  years  it  has  been  recognised  that  an  effective  way  of  both  speeding  up  partition 
refinement  and,  perhaps  more  importantly  giving  it  a  global  perspective  is  to  u.se  multi¬ 
level  techniques.  The  idea  is  to  match  pairs  of  vertices  to  form  clusters,  use  the  clusters  to 
define  a  new  graph  and  recursively  iterate  this  procedure  until  the  graph  size  falls  below 
some  threshold.  Tlie  coarsest  graph  is  then  partitioned  and  the  partition  is  successively 
optimised  on  all  the  graphs  starting  with  the  coarsest  and  ending  with  the  original.  This 
sequence  of  contraction  lullowed  by  repeated  expansion/optimisation  loops  is  known  as 
the  multilevel  paradigm  and  has  been  successfully  developed  as  a  .strategy  for  overcom¬ 
ing  the  localised  nature  of  the  KL  (and  other)  optimisation  algorithms.  The  multilevel 
idea  was  first  proposed  by  Barnard  &  Simon,  [1],  as  a  method  of  speeding  up  spectral 
bi.section  and  improved  by  Hendrickson  &  Leland,  [11],  who  generalised  it  to  encom¬ 
pass  local  refinement  algorithms.  Several  algorithms  for  carrying  out  the  matching  have 
been  devised  by  Karypis  dg  Kumar,  [13],  while  Walshaw  &  Cross  de.scribe  a  method  for 
utilising  imbalance  in  the  coar.se.st  graphs  to  enhance  the  final  partition  quality,  [19]. 


3.1  Implementation 

Graph  contraction.  To  create  a  coarser  graph  6'/+i  (V/+i .  Ei+i )  from  Gi(Vi,Ei)  we 
use  a  variant  of  the  edge  contraction  algorithm  proposed  by  Hendrickson  &  Leland, 
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[11],  The  idea  is  to  find  a  maximal  independent  subset  of  graph  edges,  or  a  matching 
of  vertices,  and  then  collapse  them.  The  set  is  independent  because  no  two  edges  in 
the  set  are  incident  on  the  same  vertex  (so  no  two  edges  in  the  set  are  adjacent),  and 
maximal  because  no  more  edges  can  be  added  to  the  set  without  breaking  the  indepen¬ 
dence  criterion.  Having  found  such  a  set,  each  selected  edge  is  collapsed  and  the  vertices, 
U1.U2  G  Vi  say,  at  either  end  of  it  are  merged  to  form  a  new  vertex  v  £  Vi+i  with  weight 
|t)|  =  luil  -f-  |W2|. 

The  initial  partition.  Having  constructed  the  series  of  graphs  until  the  number  of 
vertices  in  the  coarsest  graph  is  smaller  than  some  threshold,  the  normal  practice  of  the 
multilevel  strategy  is  to  carry  out  an  initial  partition.  Here,  following  the  idea  of  Gupta, 
[10],  we  contract  until  the  number  of  vertices  in  the  coarsest  graph  is  the  same  as  the 
number  of  subdomains,  P,  and  then  simply  assign  vertex  i  to  subdomain  5,  .  Unlike 
Gupta,  however,  we  do  not  carry  out  repeated  expansion/contraction  cycles  of  the  coars¬ 
est  graphs  to  find  a  well  balanced  initial  partition  but  instead,  since  our  optimisation  al¬ 
gorithm  incorporates  balancing,  we  commence  on  the  expansion/optimisation  sequence 
immediately. 

Partition  expansion.  Having  optimised  the  partition  on  a  graph  Gu  the  partition 
must  be  interpolated  onto  its  parent  Gi-i.  The  interpolation  itself  is  a  trivial  matter;  if 
a  vertex  r  6  V;  is  in  subdomain  Sp  then  the  matched  pair  of  vertices  that  it  represents, 
r’li  t’2  G  V/-1 ,  will  be  in  Sp. 

3.2  Incorporating  aspect  ratio 

The  matching  part  of  the  multilevel  strategy  can  be  easily  modified  in  several  ways  to 
take  into  account  AR  and  in  each  case  the  vertices  are  visited  (at  most  once)  using  a 
randomly  ordered  linked  list.  Each  vertex  is  then  matched  with  an  unmatched  neighbour 
using  the  chosen  matching  algorithm  and  it  and  its  match  removed  from  the  list.  Vertices 
with  no  unmatched  neighbours  remain  unmatched  and  are  also  removed.  In  addition  to 
Random  Matching  (R"M),  [12],  where  vertices  are  matched  with  fandom  neighbours, 
we  propose  and  have  tested  3  matching  algorithms; 

Surface  Matching  (SM).  As  we  have  seen  in  §2.2,  the  AR  partitioning  problem  can 
be  approximated  by  the  cut-edge  weight  problem  using  (4),  the  T,  cost  function,  and 
so  the  simplest  matching  is  to  use  the  Heavy  Edge  approach  of  Karypis  &  Kumar,  [  1 3], 
where  the  vertex  matches  across  the  heaviest  edge  to  any  of  its  unmatched  neighbours. 
This  is  the  same  as  matching  across  the  largest  surface  (since  here  edge  weights  represent 
surfaces)  and  we  refer  to  this  as  surface  matching. 

Template  Cost  Matching  (TCM).  A  second  approach  follows  the  ideas  of  Bouh- 
mala,  [3],  and  matches  with  the  neighbour  which  minimises  the  cost  function.  In  this 
case,  the  chosen  vertex  matches  with  the  unmatched  neighbour  which  gives  the  result¬ 
ing  element  the  best  aspect  ratio.  Using  the  Ft  cost  function,  we  refer  to  this  as  template 
cost  matching. 

Surface  Cost  Matching  (SCM).  This  is  the  same  idea  as  TCM  only  using  the  F, 
cost  function,  (4),  which  is  faster  to  calculate. 


3.3  Results  for  different  matching  functions 

In  Tables  3, 4  &  5  we  compare  the  results  in  Table  2,  where  TCM  was  u.sed,  with  RM,  SM 
&  SCM  respectively.  In  all  cases  the  SGTC  optimisation  algorithm  (see  §4.3)  was  used. 
For  each  value  of  P.  the  first  column  shows  the  average  AR,  Ft  of  the  partitioning.  The 
second  column  for  each  value  of  P  then  compares  results  with  those  in  Table  2  using  the 
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metric  tor  RM,  etc.  Thus  a  figure  >  1  means  that  RM  has  produced  worse 

results  than  TCM.  The.se  comparisons  are  then  averaged  and  so  it  can  be  seen,  e.g.  for 
P  =  16  that  RM  produces  results  24%  (1 .24)  worse  on  average  than  TCM.  Indeed  the 
average  quality  of  partitions  produced  by  RM  was  30%  worse  than  TCM.  This  is  not 
altogether  surprising  since  the  AR  of  elements  in  the  coarsest  graph  could  be  very  poor 
if  the  matching  takes  no  account  of  it,  and  hence  the  optimisation  has  to  work  with  badly 
shaped  elements. 


Table  3.  Random  matching  results  compared  with  template  cost  matching 
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When  it  comes  to  comparing  TCM  with  SM  &  SCM  (Tables  4  &  5)  there  is  actually 
very  little  difference;  SM  is  about  3.5%  worse  and  SCM  only  about  1 .5%.  This  suggests 
that  the  multilevel  strategy  is  relatively  robust  to  the  matching  algorithm  provided  the 
AR  is  taken  into  account  in  some  way. 


Table  4.  Surface  matching 

results  compared  with  template  cost  matching 

P  =  16 

P=  32 

II 

o 

P  =  128 

mesh 

P 

r(SMi-i 

n  nSMl-l 

P  rfSM)-! 

n  r(SM)-i 

r(TCM)- 

'  /'(TCM)-i 

r(TCM)-i 

riTCM)-i 

uk 

1.54 

1.13 

1.34  1.11 

1.24  1.01 

1.28  1.10 

t60k 

1.14 

0.87 

1.11  1.05 

1.12  1,10 

1.12  1.08 

dime20 

1.26 

1.18 

1 .24  1 .23 

1.15  1.00 

1.13  1.04 

c.s4 

1.22 

0.97 

1 .24  1 .08 

1.24  1.04 

1 .23  1 .00 

mesh  100  1.20 

0.78 

1.24  1.03 

1.27  1.04 

1 .26  0.94 

cyl3 

1.19 

0.93 

1.21  1.02 

1 .24  1 .05 

1 .24  1 .08 

Average 

0.98 

1.08 

1.04 

1,04 

Table  5.  Surface  cost  matching  results  compared  with  template  cost  matching 
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We  are  not  primarily  concerned  with  partitioning  times  here,  but  for  the  record,  RM 
was  about  0.5%  slower  than  TCM  (although  this  is  well  within  the  limits  of  noise).  This 
is  because  the  optimisation  stage  took  considerably  longer  (although  the  matching  was 
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much  faster  than  TCM).  SM  &  SCM  were  3.3%  &  1 .8%  faster  respectively  than  TCM. 
Overall  this  suggests  that  TCM  is  the  algorithm  of  choice  although  there  is  little  benefit 
over  SM  &  SCM. 

4  The  Kernighan-Lin  optimisation  algorithm 

In  this  section  we  discuss  the  key  features  of  an  optimisation  algorithm,  fully  described 
in  [19]  and  then  in  §4.3  describe  how  it  can  be  modified  to  optimise  for  AR.  It  is  a 
Kemighan-Lin  (KL)  type  algorithm  incorporating  a  hill-climbing  mechanism  to  enable 
it  to  escape  from  local  minima.  The  algorithm  uses  bucket  sorting  (§4.4),  the  linear  time 
complexity  improvement  of  Fiduccia  &  Mattheyses,  [9],  and  is  a  partition  optimisation 
formulation;  in  other  words  it  optimises  a  partition  of  P  subdomains  rather  than  a  bisec¬ 
tion. 


4.1  The  gain  function 

A  key  concept  in  the  method  is  the  idea  of  gain.  The  gain  g(v.q)  of  a  vertex  v  in  sub- 
domain  Sp  can  be  calculated  for  every  other  subdomain,  Sq,  q  p,  and  expresses  how 
much  the  cost  of  a  given  partition  would  be  improved  were  v  to  migrate  to  Sq.  Thus, 
if  Tt  denotes  the  current  partition  and  tt'  the  partition  if  v  migrates  to  Sq  then  for  a  cost 
function  P,  the  gain  g[v,  q)  =  r(Tr’)  -  r(rr).  Assuming  the  migration  of  v  only  affects 
the  cost  of  5;>  and  Sq  (as  is  true  for  Ft  and  r,)  then  we  get 

g(u,  q)  =  AR{Sq  +  t.’)  -  AR(S’,)  +  AR(5,,  -  v)  -  AR(Sp).  (5) 

For  Ft  this  gives  an  expression  which  cannot  be  further  simplified,  however,  for  F,, 
since 


AR{Sq  +  v)  -  AR(S,)  =  ^  {c)’(5<,  -1-  !■)  -  d'Sq  j 

=  {d’Sq  +  d’r  -  2\(Sq.v)\ - 
At 

Ai 

(where  |(5,,  v)!  denotes  the  sum  of  edge  weights  between  Sg  and  tO,  we  get 

V  ..  -  <l)  =  (6) 

Notice  in  particular  ih.i(  v  .  .  is  the  same  as  the  cut-edge  weight  gain  function  and  that  it 
is  entirely  locali.sed.  i  c  ihc  i; am  of  a  vertex  only  depends  on  the  length  of  its  boundaries 
with  a  subdomain  and  noi  any  intrinsic  qualities  of  the  subdomain  which  could  be 
changed  by  non-local  micrjinm. 

4.2  The  iterative  optimisation  algorithm 

The  serial  optimisation  algorithm,  as  is  typical  for  KL  type  algorithms,  has  inner  and 
outer  iterative  loops  with  the  outer  loop  terminating  when  no  migration  takes  place  dur¬ 
ing  an  inner  loop.  The  optimisation  uses  two  bucket  sorting  structures  or  bucket  trees 
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(see  below,  §4.4)  and  is  initialised  by  calculating  the  gain  for  all  border  vertices  and  in¬ 
serting  them  into  one  of  the  bucket  trees.  These  vertices  will  subsequently  be  referred  to 
as  candidate  vertices  and  the  tree  containing  them  as  the  candidate  tree. 

The  inner  loop  proceeds  by  examining  candidate  vertices,  highest  gain  first  (by  al¬ 
ways  picking  vertices  from  the  highest  ranked  bucket),  testing  whether  the  vertex  is  ac¬ 
ceptable  for  migration  and  then  transferring  it  to  the  other  bucket  tree  (the  tree  of  exam¬ 
ined  vertices).  This  inner  loop  terminates  when  the  candidate  tree  is  empty  although  it 
may  terminate  early  if  the  partition  cost  (i  .e.  the  number  of  cut  edges)  rises  too  far  above 
the  cost  of  the  best  partition  found  so  far.  Once  the  inner  loop  has  terminated  any  vertices 
remaining  in  the  candidate  tree  are  transferred  to  the  examined  tree  and  finally  pointers 
to  the  two  trees  are  swapped  ready  for  the  next  pass  through  the  inner  loop. 

The  algorithm  also  uses  a  KL  type  hill-climbing  strategy;  in  other  words  vertex  mi¬ 
gration  from  subdomain  to  subdomain  can  be  accepted  even  if  it  degrades  the  parti¬ 
tion  quality  and  later,  based  on  the  subsequent  evolution  of  the  partition,  either  rejected 
or  confirmed.  During  each  pass  through  the  inner  loop,  a  record  of  the  optimal  parti¬ 
tion  achieved  by  migration  within  that  loop  is  maintained  together  with  a  list  of  vertices 
which  have  migrated  since  that  value  was  attained.  If  subsequent  migration  finds  a  ‘bet¬ 
ter’  partition  then  the  migration  is  confirmed  and  the  list  is  reset.  Once  the  inner  loop 
is  tenninated,  any  vertices  remaining  in  the  list  (vertices  whose  migration  has  not  been 
confinned)  are  migrated  back  to  the  subdomains  they  came  from  when  the  optimal  cost 
was  attained. 

The  algorithm,  together  with  conditions  for  vertex  migration  acceptance  and  confir¬ 
mation  is  fully  described  in  [19]. 


4.3  Incorporating  aspect  ratio:  localisation 

One  of  the  advantages  of  using  cut-edge  weight  as  a  cost  function  is  its  localised  nature. 
When  a  graph  vertex  migrates  from  one  subdomain  to  another,  only  the  gains’of  adja¬ 
cent  vertices  are  affected.  In  contrast,  when  using  the  graph  to  optimise  AR,  if  a  vertex  r 
migrates  from  Sp  to  Sq,  the  volume  and  surface  of  both  subdomains  will  change.  This  in 
turn  means  that,  when  using  the  template  cost  function  (2).  the  gain  of  all  border  vertices 
both  within  and  abutting  subdomains  Sp  and  Sq  will  change.  Strictly  speaking,  all  these 
gains  should  be  adjusted  with  the  huge  disadvantage  that  this  may  involve  thousands  of 
floating  point  operations  and  hence  be  prohibitively  expensive.  As  an  alternative,  there¬ 
fore,  we  propose  two  localised  variants; 

Surface  Gain/Surface  Cost  (SGSC).  The  simplest  way  to  localise  the  updating  of 
the  gains  is  to  make  the  assumption  in  §2. 1  that  the  subdomains  all  have  approximately 
equal  volume  and  to  use  the  surface  cost  function  T,  from  (4).  As  mentioned  in  §2.2  the 
problem  immediately  reduces  to  the  cut-edge  weight  problem,  albeit  with  non-integer 
edge  weights,  and  from  (6)  only  the  gains  of  the  vertices  adjacent  to  the  migrating  vertex 
will  need  updating.  However,  if  this  a.ssumption  is  not  true,  it  is  not  clear  how  well  Pf 
will  optimise  the  AR  and  below  we  provide  some  experimental  results. 

Surface  Gain/Template  Cost  (SGTC).  The  second  method  we  propose  for  localis¬ 
ing  the  updates  of  gain  relies  on  the  observation  that  the  gain  is  simply  u.sed  as  a  method 
ol  rating  the  elements  so  that  the  algorithm  always  visits  those  with  highest  gain  first 
(using  the  bucket  .sort).  It  is  not  clear  how  crucial  this  rating  is  to  the  success  of  the  al¬ 
gorithm  and  indeed  Karypis  &  Kumar  demonstrated  that  (at  least  when  optimising  for 
cut-edge  weight)  almost  as  good  results  can  be  achieved  by  simply  visiting  the  vertices 
in  random  order.  [14].  We  therefore  propose  approximating  the  gain  with  the  surface  cost 
function  T,  from  (4)  to  rate  the  elements  and  store  them  in  the  bucket  tree  structure,  but 
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using  the  template  cost  function  F,  from  (2)  to  assess  the  change  in  cost  when  actually 
migrating  an  element.  This  localises  the  gain  function. 

4.4  Incorporating  aspect  ratio:  bucket  sorting  with  non-integer  gains 

The  bucket  sort  is  an  essential  tool  for  the  efficient  and  rapid  .sorting  and  adjustment  of 
vertices  by  their  gain.  The  concept  was  first  suggested  by  Fiduccia  &  Mattheyses  in  [9] 
and  the  idea  is  that  all  vertices  of  a  given  gain  g  are  placed  together  in  a  ‘bucket’  which 
is  ranked  g.  Finding  a  vertex  with  maximum  gain  then  simply  consists  of  finding  the 
(non-empty)  bucket  with  the  highest  rank  and  picking  a  vertex  from  it.  If  the  vertex  is 
subsequently  migrated  from  one  subdomain  to  another  then  the  gains  of  any  affected 
vertices  have  to  he  adjusted  and  the  list  of  vertices  which  are  candidates  for  migration 
resorted  by  gain.  Using  a  bucket  sort  for  this  operation  simply  requires  recalculating  the 
gains  and  transferring  the  affected  vertices  to  the  appropriate  buckets.  If  a  bucket  sort 
were  not  used  and,  say,  the  vertices  were  simply  stored  in  a  list  in  gain  order,  then  the 
entire  list  would  require  re.sorting  (or  at  least  merge-sorting  with  the  sorted  list  of  ad¬ 
justed  vertices),  an  essentially  0{N)  operation  for  every  migration. 

The  implementation  of  the  bucket  sort  is  fully  described  in  [  1 9].  It  includes  a  ranking 
for  prioritising  vertices  for  migration  which  incorporates  their  weight  as  well  as  their 
gain.  The  non-empty  buckets  are  stored  in  a  binary-tree  to  save  excessive  memory  use 
(since  we  do  not  know  a  priori  how  many  buckets  will  be  needed)  and  this  structure  is 
referred  to  above  as  a  bucket  tree. 

The  only  difficulty  in  adapting  this  procedure  to  AR  optimisation  is  that  with  non¬ 
integer  edge  weight,  the  gains  are  also  real  non-integer  numbers.  This  is  not  a  major 
problem  in  itself  as  we  can  just  give  buckets  an  interval  of  gains  rather  than  a  single  in¬ 
teger,  i.e.  the  bucket  ranked  1  could  contain  any  vertex  with  gain  in  the  interval  [1.0, 2.0). 
However,  if  using  the  surface  gain  function,  the  issue  of  scaling  then  arises  since  for  a 
mesh  entirely  contained  within  the  unit  square/cube,  all  the  vertices  are  likely  to  end  up 
in  one  of  two  buckets  (dependent  only  on  whether  they  have  positive  or  negative  gains). 
Fortunately,  if  using  Ft  as  a  gain  function,  as  in  SGSC  and  SGTC,  we  can  easily  calcu¬ 
late  the  maximum  possible  gain.  This  would  occur  if  the  vertex  with  the  largest  surface, 
V  6  Sf,  say,  were  entirely  surrounded  by  neighbours  in  Sq.  The  maximum  possible  gain 
is  then  2  max„g\-  dx'  (.strictly  speaking  2  max„gv'  d'v)  and  similarly  the  minimum  gain 
is  —2  max,,gi/  dv.  This  means  we  can  easily  choose  the  number  of  buckets  and  scale  the 
gain  accordingly.  A  problem  still  arises  for  meshes  with  a  high  grading  because  many 
of  the  elements  will  have  an  insignificant  surface  area  compared  to  the  maximum.  How¬ 
ever  the  experiments  carried  out  here  all  used  a  scaling  which  allowed  a  maximum  of 
100  buckets  and  we  have  tested  the  algorithm  with  up  to  10,000  buckets  without  signif¬ 
icant  penalty  in  tenns  either  memory  or  run-time. 

4.5  Results  for  diH’erent  optimisation  functions 

Table  6  compares  SGSC  against  the  SGTC  results  in  Table  2.  Both  set  of  results  use 
template  cost  matching  (TCM).  The  table  is  in  the  same  form  as  those  in  §3.3  and  shows 
that  there  is  on  average  only  a  tiny  difference  between  the  two  (SGTC  is  0.5%  better  than 
SGSC)  and  again,  with  the  exception  of  the  ‘uk’  mesh  for  F  =  16  &  32,  all  results  have 
an  average  AR  of  less  than  1.30.  This  implication  of  this  tabic  is  that  the  assumption 
made  in  §2.1,  that  all  subdomains  have  approximately  the  same  volume,  is  reasonably 
good.  However  this  assumption  is  not  necessarily  true,  because  for  example,  for  P  — 
128,  the  ‘dime20’  mesh,  with  its  high  grading,  has  a  ratio  of  maxf?5j,/  min  125,,  = 


391 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


2723.  A  po.ssible  explanation  is  that  although  the  assumption  is  false  globally,  it  is  true 
locally,  since  the  mesh  density  does  not  change  too  gradually  (as  should  be  the  case  with 
most  meshes  generated  by  adaptive  refinement)  and  so  the  volume  of  each  subdomain 
is  approximately  equal  to  that  of  its  neighbours. 


P  =  16 

P  =  32 

P  =  64 

P  =  128 

mesh 

F, 

r(.SGSC)-i 

F, 

r(SGSCi-i 

■  P, 

r{SGSC)-i 

r'  r(SGSC)-i 
r(SGTC)-i 

r(5)GTC)-i 

r(SGTC)-i 

F(SGTC)-i 

uk 

1.49 

1.02 

1.32 

1.05 

1.24 

1.02 

1.23  0.92 

t60k 

1.15 

0.95 

1.10 

0.96 

1.12 

1.07 

1.12  1.11 

dime20 

1.23 

1.03 

1.17 

0.86 

1.15 

0.98 

l.ll  0.91 

cs4 

1.20 

0.90 

1.23 

1.05 

1.24 

1.03 

1.22  0.97 

mesh]  00 

1.24 

0.95 

1.26 

1.10 

1.27 

1.06 

1.27  0.97 

cyl3 

1.23 

I.IO 

1.22 

1.08 

1.24 

1.06 

1 .22  1 .00 

Average 

0.99 

1.01 

1.04 

0.98 

Again  we  are  not  not  primarily  concerned  with  partitioning  times,  but  it  was  surpris¬ 
ing  to  see  that  SGSC  was  an  average  30%  slower  than  SGTC.  A  possible  explanation  is 
that  although  the  cost  function  T,  is  a  good  approximation,  F,  is  a  more  global  function 
and  so  the  optimisation  converges  more  quickly. 


5  Discussion 


5.1  Comparison  with  cut-edge  weight  partitioning 

In  Table  7  we  compare  AR  as  produced  by  the  edge  cut  partitioner  fEC)  described  in 
[19]  with  the  results  in  Table  2.  On  average  AR  partitioning  produces  results  which  are 
16%  better  than  those  of  the  edge  cut  partitioner  (as  could  be  expected).  However,  for 
the  mesh  ‘c.s4’  EC  partitioning  is  consistently  better  and  this  is  a  subject  for  further  in¬ 
vestigation. 


Table  7.  AR  resuliv  tor  the  cdec  cut  partitioner  compared  with  the  AR  partitioner 

P 

=  1. 

/'  =  32 

p 

=  64 

P 

=  128 

mesh 

F, 

r.EC  - 
rfAR'  - 

,  r(EC)-i 
’■  /lAR)-, 

■  Ft 

/’(ECi-i 

r(AR)-i 

•  F, 

r(EC)-i 

r(AR)-i 

uk 

1.52 

I  11“ 

!  '>  1.07 

1.26 

1.09 

1.28 

1.14 

t60k 

1.19 

1  !» 

*  1 .76 

1.17 

1.47 

1.17 

1 .55 

dime20 

1.32 

1 

.  P-  I..34 

1.25 

1.65 

1.21 

1.72 

cs4 

1.19 

0  so 

■  0.93 

1.20 

0.87 

1.21 

0.92 

mesh  100  1.22 

0  S“ 

i  : :  0.9 1 

1.26 

1.03 

1.24 

0.86 

cyl3 

1.22 

1,0^ 

'  :  ‘  1 .09 

1.23 

1 .00 

1.23 

1.02 

Average 

i.0‘t 

1.18 

1.19 

1.20 

Meanwhile  in  Table  K  »e  compare  the  edge  cut  produced  by  the  EC  partitioner  with 
that  of  the  AR  partitioner  .-Vgain  as  expected,  EC  partitioning  produces  the  best  results 
(about  1 1  %  better  than  AR  i  In  terms  of  time,  the  EC  partitioner  is  about  26%  faster  than 
AR  on  average.  Again  this  is  no  surprise  since  the  AR  partitioninginvol  ves  floating  point 
operations  (assessing  cost  and  combining  elements)  while  EC  partitioning  only  requires 
integer  operations. 
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Tables.  |£r|  results  for  the  edge  cut  pnrtitioner  compared  with  the  AR  partitioner 


P 

=  IG 

P 

=  32 

P 

=  64 

P 

=  128 

mesh 
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2294 

0.80 

3637 
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cs4 

2343 

0.86 

3351 

0.90 

4534 

0.89 

6101 

0.90 

mesh]  00 

4577 

0.77 

7109 

0.81 

10740 

0.86  14313 

0.83 

cyl3 

10458 

0.94  14986 

0,94  20765 

0.93  27869 

0.94 

Average 

0.88 

0.89 

0.90 

0.90 

5.2  Generic  multilevel  mesh  partitioning 

In  this  paper  we  have  adapted  a  mesh  partitioning  technique  originally  designed  to  solve 
the  edge  cut  partitioning  problem  to  a  different  cost  function.  The  question  then  arises, 
is  the  multilevel  strategy  an  appropriate  technique  for  solving  partitioning  problems  (or 
indeed  other  optimisation  problems)  with  different  cost  functions?  Clearly  this  is  an  im¬ 
possible  question  to  answer  in  general  but  a  few  pertinent  remarks  can  be  made: 

-  For  the  AR  based  cost  functions  at  least,  the  method  seems  relatively  sensitive  to 
whether  the  cost  is  included  in  the  matching.  ITiis  suggests  that,  if  possible,  a  generic 
multilevel  partitioner  should  use  the  cost  function  to  minimise  the  cost  of  the  match¬ 
ings.  Note,  however,  that  this  may  not  be  possible  as  a  cost  function  which,  say,  mea¬ 
sured  the  cost  of  a  mapping  onto  a  particular  processor  topology  would  be  unable 
to  function  since  at  the  matching  stage  no  partition,  and  hence  no  mapping  exists. 

-  The  optimisation  relies,  for  efficiency  at  least,  on  having  a  local  gain  function  in 
order  that  the  migration  of  a  vertex  does  not  involve  an  0{N )  update.  Here  we  were 
able  to  localise  the  cost  function  by  making  a  simple  approximation  to  give  a  local 
gain  function,  however,  it  is  not  clear  that  this  is  always  possible. 

-  The  bucket  sort  is  reasonably  simple  to  convert  to  non-inleger  gains,  however  this 
relies  on  being  able  to  estimate  the  maximum  gain.  If  this  is  not  possible  it  may  not 
be  easy  to  generate  a  good  scaling  which  separates  vertices  of  different  gains  into 
different  buckets. 

5.3  Conclusion  and  future  research 

We  have  shown  that  the  multilevel  strategy  can  be  modified  to  optimise  for  aspect  ra¬ 
tio.  To  fully  validate  the  method,  however,  we  need  to  demonstrate  that  the  measure  of 
aspect  ratio  used  here  does  indeed  provide  the  benefits  for  DD  preconditioners  that  the 
theoretical  results  suggest.  It  is  also  desirable  to  measure  the  correlation  between  aspect 
ratio  and  convergence  in  the  solver. 

ALso.  although  parallel  implementations  of  the  multilevel  strategy  doexist,  e.g.  [20], 
it  is  not  clear  how  well  AR  optimisation,  with  its  more  global  cost  function,  will  work  in 
parallel  and  this  is  another  direction  for  future  research.  Some  related  work  already  ex¬ 
ists  in  the  context  of  a  parallel  dynamic  adaptive  mesh  environment,  [5, 6,  16],  but  these 
are  not  multilevel  methods  and  it  was  necessary  to  u.se  a  combination  of  several  com¬ 
plex  cost  functions  in  order  to  achieve  reasonable  results  so  the  question  arises  whether 
multilevel  techniques  can  help  to  overcome  this. 
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Abstract.  HPF-BuiLDER  graphical  environment  provides  an  interac¬ 
tive  and  visual  solution  to  edit  and  visualize  HPF  data  mapping  direc¬ 
tives.  It  frees  the  HPF  programmers  of  all  the  syntactic  constraints.  Gen¬ 
eral  cind  detailled  visualizations  give  complete  information  about  data 
distribution  along  the  grids  of  processors. 

Compau-e  several  mappings  implies  to  evailuate  some  statistics  about  load 
distribution  and  communications.  This  paper  introduces  an  evolution  of 
HPF-Builder  which  produces  such  statistics,  and  provides  a  graphical 
way  to  visuadize  them. 


1  Introduction 

With  the  emergence  of  parallel  and  massively  parallel  machines  and  of  clusters 
of  communicating  computers,  where  the  memory  is  physically  distributed  on  a 
large  number  of  processors,  new  parallel  programming  techniques  have  appeared. 

With  data  parallel  model,  the  program  is  replicated  over  all  the  processors, 
and  vectors  or  matrices  are  distributed  across  them,  parallel  operations  being 
processed  simultaneously  by  each  processor. 

Data  parallelism  is  well  suited  in  the  domain  of  scientific  computing;  algo¬ 
rithms  have  to  manage  with  large  regular  data  structures  (vector,  matrix),  and 
the  same  treatment  has  to  be  achieved  onto  each  item  of  the  structures. 

The  expression  of  parallelism  at  the  data  level  has  the  advantage  of  main¬ 
taining  a  single  control  flow.  A  data  parallel  algorithm  consists  of  a  sequence  of 
elementary  instructions  applied  to  scalar  or  parallel  data. 

As  Fortran  is  the  standard  language  for  scientific  computing,  Fortran  90, 
a  data  parallel  extension,  has  been  developed.  It  allows  programmers  to  benefit 
of  the  data  parallel  model  without  having  to  rewrite  their  codes  in  a  completly 
new  language. 

Fortran  90  promotes  arrays  as  global  parallel  entities.  It  supports  array 
expressions  and  proposes  restructuring  operations  onto  them  (gather,  scatter, 
reductions  . . . ) . 

The  compilation  for  distributed  memory  machines  relies  on  the  notion  of 
data  distribution  by  the  use  of  mapping  directives.  These  directives  specify  sets 
of  elementary  data  that  should  be  allocated  on  the  same  processor.  HPF  (High 

*  tel.:  +33-3  20  43  47  30,  fax.:  +33-3  20  43  65  66,  e-mail:lefebvreeiifl.fr 
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Performance  Fortran)  [6,7]  is  an  example  of  this  approach  and  seems  to  be 
becoming  the  most  popular  language  for  data  parallel  scientific  programming. 

A  distributed  data  parallel  algorithm  designer  usually  starts  from  a  For¬ 
tran  90  code  and  inserts  HPF  directives  respecting  the  HPF  syntactic  rules. 
The  Fortran  90  parts  express  the  data  parallel  algorithm  itself  and  the  HPF 
directives  ensure  the  mapping  of  the  data  without  semantic  contribution.  The  ef¬ 
fects  of  these  directives  are  essential  in  balancing  between  the  parallel  processing 
and  communications.  The  programmer  has  to  insert  by  hand  all  these  mapping 
directives.  Therefore  the  scientific  programmer  must  learn  a  third  generation 
dialect  of  Fortran  to  take  eidvantage  of  parallel  machines. 

Furthermore,  the  programmer  have  to  evaluate  himself  the  accuracy  of  his 
mappings. 

Like  Fortran  90,  HPF  supports  regular  data  structures  (multi-dimensional 
arrays).  Furthermore,  HPF  provides  a  geometrical  support  to  express  the  distri¬ 
bution  of  data  among  grids  of  abstract  processors. 

The  expression  of  parallelism  at  the  data  level  allows  the  programmer  to 
have  a  visual  perception  of  the  distribution  of  data  in  space  (at  least  for  1,  2 
and  3-dimensional  arrays  and  grids) .  Often  programmers  use  papers  and  colour 
pencils  to  draw  and  improve  their  mapping  before  translating  the  drawing  into 
HPF  directives. 

The  first  goal  of  the  HPF-BuiLDER  project[5]  is  to  provide  a  tool  to  help 
the  programmer  at  this  level.  It  proposes  to  replace  the  paper  and  pencils  by  a 
screen  and  a  mouse.  Then  it  automatically  generates  the  HPF  directives  from 
the  drawing. 

HPF-Builder  graphical  environment  frees  the  programmer  from  all  the 
syntactic  constraints  due  to  the  data  mapping.  Furthermore,  it  verifies  the  co¬ 
herence  of  mappings  and  avoids  errors  (like  shifts  in  indices)  during  the  phase 
of  translation  from  drawing  to  HPF  code. 

HPF-Builder  respects  the  hierarchical  HPF  programming  model  (arrays 
aligned  together,  or  with  templates,  and  distribution  of  them  into  virtual  proces¬ 
sor  grids).  For  each  level,  HPF-Builder  provides  a  graphical  interactive  editor. 
In  a  W'YSIWYG  way  each  editor  is  able  to  generate  the  appropriate  directives 
according  to  the  data  manipulation  of  the  programmer. 

2  The  hierarchical  HPF  programming  model 

A  complete  use  of  HPF  directives  respects  a  three  level  hierarchical  approach 
(see  figure  1). 

For  each  operation  in  the  code  involving  data  parallel  handling,  remote  ac¬ 
cesses  imply  communications.  In  order  to  minimize  this  overcost,  programmers 
need  to  specify  how  each  part  of  arrays  has  to  be  placed  relatively  to  other  ones. 
HPF  alignment  directives  implement  these  specifications. 

The  second  level  is  the  template,  with  which  arrays  are  aligned. 

The  third  level,  the  processors,  defines  multidimensionnal  grids  of  abstract 
processors  into  which  the  templates  are  distributed. 
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Fig.  1.  Hierarchical  HPF  programming  model 


It  is  the  compiler  charge  (eventually  helped  by  compiler  specific  directives) 
to  decide  which  physical  computation  node  will  correspond  to  a  given  processors 
item. 

This  construction  ensures  a  progressive  refinement  of  the  data  mapping  on 
the  physical  processors.  In  this  way  the  programmer  is  able  to  group  in  the  same 
template  all  the  arrays  that  interact.  It  avoids  a  number  of  levels  due  to  array 
with  array  alignments. 

This  three  level  hierarchy  is  the  more  complete  use  of  HPF  directives.  HPF 
directives  as  alignment  between  arrays,  or  distribution  of  arrays  directly  onto 
processors  can  bypass  the  template  definition. 

All  of  these  directives  are  supported  by  HPF-BuiLDER. 

3  Graphical  interfaces  and  HPF 

To  replace  papers  and  pencils,  a  graphical  editor  has  to  provide  several  features: 

-  a  display  of  the  source  code  and/or  a  summary  of  its  syntactic  architecture 
(modules,  subroutines,  array  declarations  . . .), 

-  a  global  view  of  the  hierarchical  HPF  construction, 

-  a  general  visualization  of  each  directive, 

-  a  detailed  visualization,  with  the  possibility  of  tracing  the  mapping  of  each 
item  of  objects, 

-  a  WYSIWYG  editing  of  mapping  HPF  directives, 

-  a  graphical  tool  to  visualize  and  modify  existing  directives, 
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-  the  automatic  generation  of  the  HPF  directives, 

-  the  interpretation  of  directives  to  help  the  programmer  in  evaluating  the 
quality  of  mappings  (array  load  balancing  on  virtual  processors,  evaluation 
of  the  redistribution  and  realignment  cost  in  term  of  communications  . . . ) . 

A  few  visual  tools  already  exist  to  help  the  HPF  programmers.  Some  of  them 
are  limited  to  visualization,  they  do  not  help  with  directive  editing. 

It’s  the  case  of  Annai/DDV[4],  developped  at  CSSE/NEC,  which  allows  to  visu¬ 
alize  distributed  data.  It  is  integrated  into  a  debugger,  which  implies  to  execute 
the  code.  Its  goal  is  more  to  look  at  data  values  than  at  their  mappings. 

Often,  such  tools  need  to  execute  the  code  to  process  effectively  the  data 
mapping. 

For  example,  DAQV[11]  or  Prism[12],  allows  to  trace  communications  at  runtime, 
and  to  generate  accurate  statistics,  but  the  user  has  to  execute  heavy  codes  with 
large  amount  of  data,  for  each  mapping  he  wants  to  test. 

We  prefer  to  evaluate  the  mapping  during  the  editing  phase. 

Another  limitation  we  want  to  avoid  is  to  be  dedicated  to  a  particular  com¬ 
piler,  as  GDDT[8]  does  into  the  Vienna  Fortr^in  environment.  It  is  well  suited  to 
visualize  mapping  onto  physical  processors,  and  to  generate  real  communication 
statistics,  but  it  limits  the  user  to  a  particular  kind  of  targets. 

Lastly,  our  goal  is  to  work  only  on  mapping,  and  not  on  the  code  production. 
We  don’t  wish  a  complete  visual  programming  solution,  like  Help-Draw  [1], 
where  the  user  programs  everything  from  scratch  to  get  automaticaly  a  HPF 
code. 


4  HPF-Builder 


HPF-Builder  is  built  according  to  the  HPF  programming  model.  For  each  level 
of  the  data  mapping  hierarchical  representation  a  graphical  editor  is  defined  to 
visualize  and  modify  in  a  WYSIWYG  way  the  corresponding  HPF  directives.  This 
procures  a  step  by  step  transformation  from  a  Fortran  90  code  towards  an 
HPF  version.  The  data  parallel  algorithm  expressed  in  Fortran  90  is  never 
modified.  The  HPF  transformations  concern  exclusively  the  data  mapping. 

For  each  level,  we  present  the  corresponding  editor  with  its  main  specifica¬ 
tions. 


4.1  Example  program 

This  matrix/ vector  product  example  is  used  along  this  paper  to  describe  tlie 
step  by  step  transformation; 
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integer  : :  NCol ,  NLine 
paurameter  (NCol=20) 

subroutine  MV(M,V,R.  NLine) 
integer  : :  NLine 

real,  dimension(NLine ,NCol) ,  intent(in) : :  H 
real,  dimension(NCol) ,  intent(in)  ::  V 

real,  dimension(NLine) ,  intent(out)  ::  R 

R(l:NLine)=  0.0 
do  k  =  l,NCol 

forall(i=  1: NLine) 

R(i)=R(i)  +  V(k)*M(i,k) 
end  forall 
end  do 

end  subroutine 

To  generate  an  efficient  V(k)*M(i,k)  product,  we  must  align  together  the 
parts  of  M  and  V  that  interact.  That  means,  each  item  V(k)  must  be  aligned  with 
M(i,k)  for  each  i.  So,  V  items  must  be  replicated  along  columns  of  M. 

In  the  same  way,  the  sum  implies  to  replicate  R  along  the  lines  of  M. 

The  processor  grid  used  is  a  2D  mesh.  The  matrix  M  is  arbitrarily  distributed 
Cyclic,  Block  on  this  grid:  Implicit  communications  will  be  produced  by  the 
compiler  to  update  the  R  values  replicated  on  the  second  dimension. 

Finally,  we  obtain  this  HPF  code: 

subroutine  MV(M,V,R,  NLine) 
integer  : :  NLine 

real,  dimension(NLine,NCol) ,  intent(in)::  M 
real,  dimension(NCol) ,  intent(in)  ::  V 

real,  diinension(NLine) ,  intent(out)  ::  R 
!HPF$  PROCESSORS  MyProc(NUMBER_0F_PR0CESS0RS( 1) ,  ft 
!HPF$  NUMBER_0F_PR0CESS0RS(2)) 

!HPF$  DISTRIBUTE  IKCyclic,  Block)  ONTO  MyProc 

!HPF$  ALIGN  V( : )  WITH  H(*,:) 

!HPF$  ALIGN  R( : )  WITH  H( :  ,*) 

R(1 : NLine )s  0  0 
do  k  =  l.NCol 

forall(i=  1  MLiae) 

R(i)=R(i)  •  V<k)*M(i,k) 
end  forall 
end  do 

end  subroutine 
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4.2  Source  editing  and  parsing 

The  first  phase  of  HPF-BuiLDER  concerns  the  analysis  of  the  source  file.  A 
modified  version  of  Cocktail  HPF  parser[9]  is  used.  It  supports  Fortran  90 
and  almost  all  the  data  mapping  directives  of  HPF.  From  both  Fortran  90 
and  HPF  code,  HPF-BuiLDER  is  able  to  build  the  syntactic  tree  of  array  and 
HPF  directive  declarations  and  the  hierarchical  skeleton  of  the  program. 


mi 


‘«s«u)0| 


'n: 


'A  ^ 


sublet  in*  NV(K,V,ft,  NLln*) 
int*9*r  ::  WI.ln* 

r«al,  diB«nslcn(NLln*,]fCol)«  int«nt(ln)::  f 
dla€n8lon(NCol),  lnt«nt<in)  ::  S 
r«al,  cilB«n8ion(KLin«),  lnt*nt(out)  ::  & 

R(l:NLln*)-  0.0 
do  Jc  -  l,MCol 

forallU-  l:NLin«) 

R(l)«R(i)  +  V{)t)*M(i,k) 

•nd  forall 
•nd  do 

«nd  subroutln* 


(a)  Source 


(b)  Tree 


Fig.  2.  Source  and  syntactic  tree  at  begining 


At  this  step,  HPF-BuiLDER  presents: 


Fig.  3.  skeleton 


-  A  full  screen  editor  opened  in  the  source  window 
(2(a)).  The  content  of  this  editor  is  updated  with 
any  interactive  graphical  manipulation.  Underlined 
pieces  of  text  indicate  selectable  objects. 

-  The  syntactic  tree  summary,  in  the  tree  window 
(2(b)). 

-  array  and  variable  declarations,  represented  by  icons 
in  the  skeleton  window  (3). 

The  main  window  for  visualization  and  edition  is  the 
skeleton.  The  other  windows  add  informations  in  the  syn¬ 
tactic  structure  of  the  program,  and  reflect  automaticaly 
any  modification  made  by  the  user. 

Clicking  an  entry  anywhere  selects  it  in  the  three  win¬ 
dows.  Several  different  objects  can  be  opened  at  a  time. 
This  lets  the  user  see  details  about  several  objects  at  a 
time. 

In  the  skeleton  window,  selection  changes  the  icon  in 
a  subwindow  (array  M  in  figure  3)  which  presents  some 
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details  about  the  object:  Its  name,  rank  and  size,  and  a  wire  representation  in 
which  directives  will  be  displayed. 

We  can  see  in  the  subwindow  of  array  M  that  an  interprocedural  analysis  is 
performed  to  find  the  value  of  the  constant  NCol.  On  the  other  hand,  as  NLine 
as  an  unknown  value  at  parsing  time,  a  default  value  of  10  (marked  by  a  “?”)  is 
taken  (the  user  can  specify  other  values  to  test  different  cases). 


4.3  Processors  ^lnd  distributions 

In  the  skeleton,  clicking  on  an  subwindow  opens  a  menu  from  which  new  direc¬ 
tives  can  be  created.  A  drag’n  drop  to  another  icon  specifies  the  second  entry  to 
set  an  alignment  or  distribution.  A  creation  menu  allows  to  create  new  templates 
and  processors. 

HPF  imposes  some  restrictions  about  alignments  and  distributions.  For  ex¬ 
ample,  an  already  distributed  object  can’t  be  realigned. 

To  avoid  the  user  to  create  such  an  invalid  directive,  the  creation  menu  is 
adapted  for  each  object.  For  a  distributed  object,  the  “realign”  entry  is  disabled. 

Furthermore,  HPF  imposes 
that  processors  size  matches  the 
number  of  physical  processors. 
The  intrisic  numberjof .proce¬ 
ssors  returns  this  number. 

As  HPF-Builder  is  not 
dedicated  to  a  given  target  com¬ 
puter,  a  configuration  option 
defines  this  value.  Therefore, 
the  parser  is  able  to  evaluate 
this  function  call  as  a  constant 
value. 

Thus,  HPF-Builder  let 
the  user  create  graphically  the 
global  structure  of  its  HPF 
skeleton,  and  verifies  their  co¬ 
herency. 

Once  the  two  dimensionnal  processor  mesh  MyProc  is  created,  a  distribution 
directive  can  be  setted  between  M  and  MyProc  (figure  4). 

Into  the  editing  window  associated  with  this  directive,  a  block,  cyclic,  or 
collapsed  distribution  can  be  specified  for  each  dimension  of  the  distributee. 

Then,  into  the  wire  representation  of  the  processors,  the  projection  of  M  is 
drawn.  It  shows  cyclic  distribution  by  an  arrow  ended  by  a  small  loop.  A  dashed 
line  is  added  for  cyclic(k)  and  block(k)  specifications. 

Beside  each  distribution  specification,  a  formula  describe  in  detail  the  dis¬ 
tribution.  In  the  example  the  expression  (2  x  3)  d-  (2  x  2)  specifies  that  the  2 
first  lines  contains  3  lines  of  the  template,  and  the  2  last  contains  2  lines.  This 
describes  a  cyclic  distribution  which  does  2  loops  and  a  half. 


i  wacit 
Cyca  icf a) 
SlocUn) 
Goll«p*«d 
D09tr€ff 


Fig.  4.  Distribution  specification 
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On  the  other  dimension,  the  block  distribution  cuts  the  template  in  4  blocks  of 
5  columns. 

These  two  parts  of  the  visualization  let  the  user  see  a  global  draft  of  its 
distribution,  and  a  more  detailled  aspect  of  the  processor  load. 


4.4  Alignments 

As  for  the  distribution,  a  drag’n  drop  between  V  and  M  let  specify  an  alignement 
directive  (see  the  creation  menu  in  figure  5(a)). 


(a)  2D 


(b)  3D 


Fig.  6.  alignments  visucdization  and  edition 


In  the  same  way,  R  will  be  replicated  along  the  lines  of  M. 

By  default,  direct  alignment  is  chosen,  after  what  selecting  the  alignment 
icon  allows  to  change  its  specifications:  Here,  we  modify  the  direction  where  V 
must  be  aligned,  and  then  we  apply  the  replicate  action  in  the  other  direction. 

These  specifications  are  displayed  in  the  alignment  selection  (central  selection 
of  figure  5(a)) 

Now,  in  the  wir-  ri-prevntation  of  the  array  M,  the  image  of  V  is  drawn.  It 
follows  the  column-  .>(  N  and  its  replication  is  shown  by  a  curve  along  the  lines 
(right  selection  in  fitur»'  'nai). 

Collapsing  is  slmwn  t.>  a  double  arrow,  and  stepped  alignment  by  a  dashed 
line  (figure  5(b)). 

This  wire  represent  at  )..n  let  the  user  see  globally  where  its  data  are  aligned. 
Replications  and  step-  apf>ear  clearly,  following  the  geometrical  aspect  of  HPF. 

Visualization  subwindows  can  be  resized  and  zoomed  in  or  out,  therefore,  the 
size  of  the  objects  is  not  a  limit. 
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4.5  Detained  visuedization 

Once  all  of  the  directives  are  defined,  one  wants  to  know  if  data  are  really  mapped 
as  he  was  thinking.  For  that,  HPF-BuiLDER  uses  a  zoom  effect  to  watch  exactly 
what  parts  of  data  are  mapped  onto  a  given  processor. 

The  small  compas  under  each  edition  subwindow,  let  the  user  move  a  cursor 
along  the  objects.  Its  projection  into  and  from  its  upper  and  lower  subwindows 
is  drawn.  Therefore,  the  user  can  see  where  a  given  item  is  projected  and  what 
parts  of  other  objects  are  projected  into  it.  Beside  this  compas,  a  label  indicates 
exact  coordinates  of  the  cursor  and  of  its  projection.  Thus,  when  data  are  very 
large,  the  draft  gives  a  graphical  information,  and  this  label  gives  numeric  values. 

In  figure  6,  V(10)  (the  cube  in  the  upper  left  selection)  is  mapped  onto  all 
the  10th  column  of  M  (bar  in  the  center  selection),  itself  distributed  into  the 
second  column  of  MYPROC  (the  column  of  the  right  hand  selection). 

To  see  the  processors  load,  the  zoom  effect  can  be  used  in  the  other  sense: 
We  can  see  that  P(l,2)  (upper  left  cube  in  right  hand  selection)  contains  lines 
1  to  3  and  columns  2,  6  ...  18  of  M  (bars  in  the  central  selection),  items  2,  6,  10 
. . .  18  of  V,  and  items  1  to  3  of  R 

So,  when  i  =  1  and  k  =  10,  the  instruction  R(1)=R(1)+V(10)*M(1, 10)  will 
find  all  its  operands  onto  the  same  processor  MYPROC (1,2). 


Fig.  6.  DetaiUed  visualization 


Now,  the  user  can  change  distribution  specifications.  HPF-BuiLDER  auto¬ 
matically  update  all  the  visual  perception  of  this  code.  Programmer  can  con¬ 
cludes  distribution  don't  change  the  locality  of  interacting  items  of  M,  V  and 
R. 

After  that,  other  experimentations  using  realignment  directives  could  pro¬ 
duce  less  implicit  communications  due  to  replications. 
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This  example  shows  how,  even  without  execution  and  without  knowing  all 
variables  values  and  physical  distributions,  HPF-BuiLDER  can  help  the  user  in 
choosing  a  priori  a  better  data  mapping. 

5  Communication  visual  predictions 

Once  the  programmer  created  its  mapping,  efficiency  have  to  be  demonstrated. 
This  is  achieved  by  using  tools  to  visualize  data  distribution  load  and  commu¬ 
nication  cost  predictions. 

The  following  instruction,  with  cyclic  or  block  distribution,  is  taken  as  an 
example: 

forall(i=2:size(A, 1),  j=2:size(a,2)) 

A(i,j)=  B(i)+A(i-1, j-1) 

These  predictions  can  be  classified  in  several  parts: 

-  The  amount  of  data  stored  on  each  virtual  processor.  In  order  to  know  the 
efficiency  of  data  distribution,  a  simple  histogram  with  a  bar  per  processor 
is  used  (figure  7).  Clicking  one  of  these  bars  can  display  a  list  with  details 
of  data  stored  on  it  (like  in  the  zoom  effect  described  in  4.5). 


Cyclic  distribution  Block  distribution 


Fig.  7.  load  histogram  example 


-  For  a  given  instruction,  the  number  of  operations  needed  onto  each  virtual 
processor.  This  is  equivalent  to  the  number  of  LHS  data  onto  each  processor 
(assuming  the  owner  computing  rule).  Thus,  the  visualization  is  the  same. 

-  The  number  of  data  movements  implied  by  an  instruction,  for  a  given  pro¬ 
cessor.  in  input  (respectively  output).  From  the  instruction  to  be  executed 
onto  a  given  processor,  we  can  deduce  which  data  have  to  be  read  (written). 
Then,  we  can  obtain  histograms  showing  where  these  data  come  from  (go 
to). 
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Figure  8  shows  data  movements  around  processors  (3,3).  Movements  come 
from  white  blocks  and  go  to  black  ones.  In  the  example,  the  processor  reads 
data  from  (2,2)  and  (3,4)  and  sends  other  ones  to  (4,4). 


Fig.  8.  data  movements  from  cind  to  one  processor 


-  The  total  number  of  movements  implied  by  an  instruction.  This  means  to 
iterate  the  previous  results  in  one  graph  for  all  the  processors  (figure  9). 
Here  again,  the  user  can  click  a  bar  to  see  details  about  data  origins  and 
destinations. 


Cyclic  distribution  Block  distribution 


Fig.  9.  global  data  movements 


The  same  visualization  tool  can  be  extended  for  a  loop  nest,  or  a  block  of 
instructions.  The  computation  can  be  iterated  for  each  instruction  of  a  block, 
and  then  iterated  for  each  loop  of  the  nest.  The  CPU  time  could  become  huge 
according  to  the  number  of  abstract  processors. 
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Implementation 

All  the  informations  needed  to  evaluate  the  distribution  load  are  the  same  as 
the  ones  needed  to  accurately  visualize  the  mappings.  During  the  visualization, 
the  user  can  specify  default  values  for  variables  and  runtime  data.  Thus,  the 
load  distribution  is  calculable  at  visualization  time,  without  execution  or  even 
compilation  of  the  program. 

Evaluation  of  communication  costs  uses  the  same  methods  than  HPF  com¬ 
pilers:  for  each  processor,  we  have  to  identify  which  data  has  to  been  sent  to 
(received  from)  every  other  processors. 

A  solution  currently  studied  in  [2]  consists  in  identify  how  communications 
are  computed  into  the  code  generated  by  a  compiler  as  Adaptor. 

For  large  data  or  grids  of  processors,  the  calculation  time  could  become  huge 
(too  huge  for  interactive  evaluation). 

Assuming  the  owner  compute  rule,  any  given  instruction  implies  data  move¬ 
ments  for  each  remote  access.  Their  number  can  be  interpreted  as  an  enumera¬ 
tion  of  common  points  between  two  sets  of  positions.  These  calculations  may  not 
need  to  enumerate  all  data  movements.  It  is  possible  to  eval  them  with  symbolic 
methods,  to  obtain  formulas  depending  of  variables  and  runtime  data.  While  the 
user  sets  this  values,  it  is  possible  to  visualize  data  movements  without  having 
to  compute  everything  from  scratch.  Furthermore,  this  method  is  independent 
of  the  size  of  data.  First  results  were  obtained  in  [3]  for  a  global  communication 
cost  evaluation. 


6  Conclusion 

Data  parallel  programming  is  still  a  difficult  art.  Scientific  programmers  have 
expended  a  lot  of  efforts  in  learning  vector  programming.  Now  they  have  to 
learn  a  third  generation  dialect  of  Fortran  to  map  their  data  onto  distributed 
memory  machines.  To  succeed  in  this  task,  they  need  some  tools  to  help  them 
to  manage  their  data  distributions.  HPF-BuiLDER  is  a  first  step  in  Computer 
Assisted  High  Performance  Programming.  The  automatic  insertion  of  HPF  di¬ 
rectives  in  a  Fortran  90  code  frees  the  programmer  from  the  new  syntactic 
constraints. 

Optimization  of  both  load  distribution  and  communication  overhead  is  a  key 
element  for  parallel  programming.  The  extension  of  HPF-Builder  presented  in 
this  paper  gives  more  informations  to  guide  the  programmer  during  this  devel¬ 
opment  phase  in  HPF. 

Visualization  of  distribution  and  prediction  of  communication  costs  lead  the 
user  to  refine  his  HPF  directives  during  the  editing  phaise. 

HPF-Builder  is  a  good  plateform  into  which  such  tools  can  be  plugged  in. 

The  user  still  decides  if  a  solution  is  better  than  another  one.  Future  works 
should  include  more  complex  evaluation  methods  to  guide  the  user  to  better 
mappings. 
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More  target  specific  informations  like  computation/communication  overlap¬ 
ping,  netword  capabilities,  cache  effects  . . .  may  be  taken  into  account  in  these 
methods. 

The  last  version  of  HPF- BUILDER  is  always  available  on  the  Web[10].  All 
are  welcomed  to  use  it  and  report  all  comments  on  improving  the  functionalities 
of  this  tool. 
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Abstract.  The  natural  world  is  certainly  not  organised  through  a  cen¬ 
tral  thread  of  control.  Things  happen  as,  the  result  of  the  actions  and 
interactions  of  unimaginably  large  numbers  of  independent  agents,  oper¬ 
ating  at  all  levels  of  scale  from  nuclear  to  astronomic.  Computer  systems 
aiming  to  be  of  real  use  in  this  real  world  need  to  model,  at  the  appro¬ 
priate  level  of  abstraction,  that  part  of  it  for  which  it  is  to  be  of  service. 
If  that  modelling  can  reflect  the  natural  concurrency  in  the  system,  it 
ought  to  be  much  simpler 

Yet,  traditionally,  concurrent  programming  is  considered  to  be  an  ad¬ 
vanced  and  difficult  topic  -  certainly  much  harder  than  serial  computing 
which,  therefore,  needs  to  be  mastered  first.  But  this  tradition  is  wrong. 

This  talk  presents  an  intuitive,  sound  and  practical  model  of  parallel 
computing  that  can  be  mastered  by  undergraduate  students  in  the  first 
year  of  a  computing  (major)  degree.  It  is  based  upon  Hoare’s  mathe¬ 
matical  theory  of  Communicating  Sequential  Processes  (CSP),  but  does 
not  require  mathematical  maturity  from  the  students  -  that  maturity  is 
pre-engineered  in  the  model.  Fluency  can  be  quickly  developed  in  both 
message-passing  and  shared-memory  concurrency,  whilst  learning  to  cope 
with  key  issues  such  as  race  hazards,  deadlock,  livelock,  process  starva¬ 
tion  and  the  efficient  use  of  resources.  Practiced  work  can  be  hosted  on 
commodity  PCs  or  UNIX  workstations  using  either  Java  or  the  occam 
multiprocessing  language.  Armed  with  this  maturity,  students  are  well- 
prepared  for  coping  with  real  problems  on  real  parallel  architectures  that 
have,  possibly,  less  robust  mathematical  foundations. 


1  Introduction 

At  Kent,  we  have  l>e**t>  teaching  parallel  computing  at  the  undergraduate  level 
for  the  past  ten  year  %  ( inginally,  this  was  presented  to  first-year  students  before 
they  became  too  set  m  tlw  ways  of  serial  logic.  When  this  course  was  expanded 
into  a  full  unit  (aUnjt  .ki  hours  of  teaching),  timetable  pressure  moved  it  into 
the  second  year.  Either  »«y,  the  material  is  easy  to  absorb  and,  after  only  a 
few  (around  5)  hour"  <>(  teaching,  students  have  no  difficulty  in  grappling  with 
the  interactions  of  2.j  'a\  threads  of  control,  appreciating  and  eliminating  race 
hazards  and  deadlock 
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Parallel  computing  is  still  an  immature  discipline  with  many  conflicting  cul¬ 
tures.  Our  approach  to  educating  people  into  successful  exploitation  of  parallel 
mechanisms  is  based  upon  focusing  on  parallelism  as  a  powerful  tool  for  simpli¬ 
fying  the  description  of  systems,  rather  than  simply  as  a  means  for  improving 
their  performance.  We  never  start  with  an  existing  serial  algorithm  and  say: 
‘OK,  let’s  parallelise  that!’.  And  we  work  solely  with  a  model  of  concurrency 
that  has  a  semantics  that  is  compositional  -  a  fancy  word  for  WYSIWYG  -  since, 
without  that  property,  combinatorial  explosions  of  complexity  always  get  us  as 
soon  as  we  step  away  from  simple  examples.  In  our  view,  this  rules  out  low-level 
concurrency  mechanisms,  such  as  spin-locks,  mutexes  and  semaphores,  as  well 
as  some  of  the  higher-level  ones  (like  monitors). 

Communicating  Sequential  Processes  (CSP)[l-3]  is  a  mathematical  theory  for 
specifying  and  verifying  complex  patterns  of  behaviour  arising  from  interactions 
between  concurrent  objects.  Developed  by  Tony  Hoare  in  the  light  of  earlier 
work  on  monitors,  CSP  has  a  compositional  semantics  that  greatly  simplifies 
the  design  and  engineering  of  such  systems  -  so  much  so,  that  parzdlel  design 
often  becomes  easier  to  manage  than  its  serial  counterpart.  CSP  primitives  have 
also  proven  to  be  extremely  lightweight,  with  overheads  in  the  order  of  a  few 
hundred  nanoseconds  for  channel  synchronisation  (including  context-switch)  on 
current  microprocessors  [4,5]. 

Recently,  the  CSP  model  has  been  introduced  into  the  Java  programming 
language  [6-10].  Implemented  as  a  library  of  packages  [11,12],  JavaPP[10]  en¬ 
ables  multithreaded  systems  to  be  designed,  implemented  and  reasoned  about 
entirely  in  terms  of  CSP  synchronisation  primitives  (channels,  events,  etc.)  and 
constructors  (parallel,  choice,  etc.).  This  allows  20  years  of  theory,  design  pat¬ 
terns  (with  formally  proven  good  properties  -  such  as  the  absence  of  race  hazards, 
deadlock,  livelock  and  thread  starvation),  tools  supporting  those  design  patterns, 
education  and  experience  to  be  deployed  in  support  of  Java-based  multithreaded 
applications. 

2  Processes,  Channels  and  Message  Passing 

This  section  describes  a  simple  and  structured  multiprocessing  model  derived 
from  CSP.  It  is  easy  to  teach  and  can  describe  arbitrarily  complex  systems.  No 
formal  mathematics  need  be  presented  -  we  rely  on  an  intuitive  understanding 
of  how  the  world  works. 


2.1  Processes 

A  process  is  a  component  that  encapsulates  some  data  structures  and  algorithms 
for  manipulating  that  data.  Both  its  data  and  algorithms  are  private.  The  outside 
world  can  neither  see  that  data  nor  execute  those  algorithms.  Each  process  is 
alive,  executing  its  own  algorithms  on  its  own  data.  Because  those  algorithms  are 
executed  by  the  component  in  its  own  thread  (or  threads)  of  control,  they  express 
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the  behaviour  of  the  component  from  its  own  point  of  view^  This  considerably 
simplifies  that  expression. 

A  sequential  process  is  simply  a  process  whose  algorithms  execute  in  a  single 
thread  of  control.  A  network  is  a  collection  of  processes  (and  is,  itself,  a  process). 
Note  that  recursive  hierarchies  of  structure  are  part  of  this  model:  a  network  is 
a  collection  of  processes,  each  of  which  may  be  a  sub-network  or  a  sequential 
process. 

But  how  do  the  processes  within  a  network  interact  to  achieve  the  behaviour 
required  from  the  network?  They  can’t  see  each  other’s  data  nor  execute  each 
other’s  algorithms  -  at  least,  not  if  they  abide  by  the  rules. 

2.2  Synchronising  Channels 

The  simplest  form  of  interaction  is  synchronised  message-passing  along  channels. 
The  simplest  form  of  channel  is  zero-buffered  and  point-to-point.  Such  channels 
correspond  very  closely  to  our  intuitive  understanding  of  a  wire  connecting  two 
(hardware)  components. 


C 


Fig.  1.  A  simple  network 

In  Figure  1,  A  and  B  are  processes  and  c  is  a  channel  connecting  them.  A  wire 
has  no  capacity  to  hold  data  and  is  only  a  medium  for  transmission.  To  avoid 
undetected  loss  of  data,  channel  communication  is  synchronised.  This  means 
that  if  A  transmits  before  B  is  ready  to  receive,  then  A  will  block.  Similarly,  if 
B  tries  to  receive  before  A  transmits,  B  will  block.  When  both  are  ready,  a  data 
packet  is  transferred  -  directly  from  the  state  space  of  A  into  the  state  space  of 
B.  We  have  a  synchronised  distributed  assignment. 

2.3  Legoland 

Much  can  be  done  just  with  this  simple  model  -  from  the  design  of  self-timed  dig¬ 
ital  logic  (no  global  clock)  through  to  the  wide  range  of  industrial  multiprocessor 
embedded  control  for  which  occam[13-16]  was  orignally  designed. 

Here  are  some  simple  examples  to  build  up  fluency.  First  we  introduce  some 
elementary  components  from  our  ‘teaching’  catalogue  -  see  Figure  2.  All  pro¬ 
cesses  are  cyclic  and  all  transmit  and  receive  just  numbers.  The  Id  process  cycles 

'  This  is  in  contrast  with  simple  ‘objects’  and  their  ‘methods’.  A  method  body  nor¬ 
mally  executes  in  the  thread  of  control  of  the  invoking  object.  Consequently,  object 
behaviour  is  expressed  from  the  point  of  view  of  its  environment  rather  than  the 
object  itself.  This  is  a  slightly  confusing  property  of  traditional  ‘object-oriented’ 
programming. 
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through  waiting  for  a  number  to  arrive  and,  then,  sending  it  on.  Although  in¬ 
serting  an  Id  process  in  a  wire  will  clearly  not  affect  the  data  flowing  through 
it,  it  does  maJce  a  difference.  A  bare  wire  has  no  buffering  capacity.  A  wire  con¬ 
taining  an  Id  process  gives  us  a  one-place  FIFO.  Connect  20  in  series  and  we 
get  a  20-place  FIFO  -  sophisticated  function  from  a  trivial  design. 


Prefix  (n,  in,  out)  Tail  (in,  out) 

Fig.  2.  Extract  from  a  component  catalogue 


Succ  is  like  Id,  but  increments  each  number  as  it  flows  through.  The  Plus 
component  waits  until  a  number  arrives  on  each  input  line  (accepting  their  arrival 
in  either  order)  and  outputs  their  sum.  Delta  waits  for  a  number  to  arrive  and, 
then,  broadcasts  it  in  parallel  on  its  two  output  lines  -  both  those  outputs  must 
complete  (in  either  order)  before  it  cycles  round  to  accept  further  input.  Prefix 
first  outputs  the  number  stamped  on  it  and  then  behaves  like  Id.  Tail  swallows 
its  first  input  without  passing  it  on  and  then,  also,  behaves  like  Id.  Prefix 
and  Tail  are  so  named  because  they  perform,  respectively,  prefixing  and  tail 
operations  on  the  streams  of  data  flowing  through  them. 

It’s  essential  to  provide  a  practical  environment  in  which  students  can  develop 
executable  versions  of  these  components  and  play  with  them  (by  plugging  them 
together  and  seeing  what  happens).  This  is  easy  to  do  in  Occam  and  now,  with 
the  JCSP  library[ll],  in  Java.  Appendices  A  and  B  give  some  of  the  details.  Here 
we  only  give  some  CSP  pseudo-code  for  our  catalogue  (because  that’s  shorter 
than  the  real  code): 

Id  (in,  out)  =  in  ?  X  — >  out  !  x  — >  Id  (in,  out) 

Succ  (in,  out)  =  in  ?  X  — >  out  !  (x+1)  — >  Succ  (in,  out) 
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Plus  (inO,  ini,  out) 

=  ((inO  ?  xO  — >  SKIP)  II  (ini  ?  xl  -->  SKIP));  ' 
out  !  (xO  +  xl)  — >  Plus  (inO,  ini,  out) 

Delta  (in,  outO,  outl) 

=  in  ?  X  — >  ((outO  !  X  -->  SKIP)  i|  (outl  !  x  — >  SKIP)); 

Delta  (in,  outO,  outl) 

Prefix  (n,  in,  out)  =  out  !  n  — >  Id  (in,  out) 

Tail  (in,  out)  =  in  ?  x  — >  Id  (in,  out) 

[Notes:  ‘free’  variables  used  in  these  pseudo-codes  are  assumed  to  be  locally 
declared  and  hidden  from  outside  view.  All  these  components  are  sequential  pro¬ 
cesses.  The  process  (in  ?  x  — >  P  (...))  means:  “wait  until  you  can  engage 
in  the  input  event  (in  ?  x)  and,  then,  become  the  process  P  The  input 

operator  (?)  and  output  operator  (!)  bind  more  tightly  than  the  — >.] 

2.4  Plug  and  Play 

Plugging  these  components  together  and  reasoning  about  the  resulting  behaviour 
is  easy.  Thanks  to  the  rules  on  process  privacy^,  race  hazards  leading  to  unpre¬ 
dictable  internal  state  do  not  arise.  Thanks  to  the  rules  on  channel  synchronisa¬ 
tion,  data  loss  or  corruption  during  communication  cannot  occur What  makes 
the  reasoning  simple  is  that  the  parallel  constructor  and  channel  primitives  are 
deterministic.  Non-determinism  has  to  be  explicitly  designed  into  a  process  and 
coded  -  it  can’t  sneak  in  by  accident! 

Figure  3  shows  a  simple  example  of  reasoning  about  network  composition. 
Connect  a  Prefix  and  a  Tail  and  we  get  two  Ids: 

(Prefix  (in,  c)  I  I  Tail  (c,  out))  =  (Id  (in,  c)  | I  Id  (c,  out)) 

Equivalence  means  that  no  environment  (i.e.  external  network  in  which  they 
are  placed)  can  tell  them  apart.  In  this  case,  both  circuit  fragments  implement  a 
2-place  FIFO.  The  only  place  where  anything  different  happens  is  on  the  internal 
wire  and  that’s  undetectable  from  outside.  The  formal  proof  is  a  one-liner  from 
the  definition  of  the  parallel  (II),  communications  (!,  ?)  and  and-then-becomes 
( — >)  operators  in  CSP.  But  the  good  thing  about  CSP  is  that  the  mathematics 
engineered  into  its  design  and  semantics  cleanly  reflects  an  intuitive  human  feel 
for  the  model.  We  can  see  the  equivalence  at  a  glance  and  this  quickly  builds 
confidence  both  for  us  and  our  students. 

^  No  external  access  to  internal  data.  No  external  execution  of  interned  algorithms 
(methods). 

®  Unreliable  communications  over  a  distributed  network  can  be  accommodated  in  this 
model  -  the  unreliable  network  being  another  active  process  (or  set  of  processes) 
that  happens  not  to  guareintee  to  pass  things  through  correctly. 
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Pairs  (in,  out) 

Fig.  4.  Some  more  interesting  circuits 
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Figure  4  shows  some  more  interesting  circuits  with  the  first  two  incorporating 
feedback.  What  do  they  do?  Ask  the  students!  Here  axe  some  CSP  pseudo-codes 
for  these  circuits: 

Numbers  (out) 

=  Prefix  (0,  c,  a)  II  Delta  (a,  out,  b)  II  Succ  (b,  c) 

Integrate  (in,  out) 

=  Plus  (in,  c,  a)  II  Delta  (a,  out,  b)  II  Prefix  (0,  b,  c) 

Pairs  (in,  out) 

=  Delta  (in,  a,  b)  II  Tail  (b,  c)  | |  Plus  (a,  c,  out) 

Again,  our  rule  for  these  pseudo-codes  means  that  a,  b  and  c  are  locally  . 
declared  channels  (hidden,  in  the  CSP  sense,  from  the  outside  world).  Appendices 
A  and  B  list  Occam  and  Java  executables  -  notice  how  closely  they  reflect  the 
CSP. 

Back  to  what  these  circuits  do:  Numbers  generates  the  sequence  of  natural 
numbers,  Integrate  computes  running  sums  of  its  inputs  and  Pairs  outputs 
the  sum  of  its  last  two  inputs.  If  we  wish  to  be  more  formal,  let  c<i>  represent 
the  i’th  element  that  passes  through  channel  c  -  i.e.  the  first  element  through 
is  c<l>.  Then,  for  any  i  >=  1: 

numbers;  out<i>  =  i  -  1 

integrate:  out<i>  =  Sum  {in<j>  I  j  •  l..i} 

pairs :  out<i>  =  in<i>  +  in<i  +  1> 

Be  careful  that  the  above  only  details  part  of  the  specification  of  these  circuits: 
how  the  values  in  their  output  stre£im(s)  relate  to  the  values  in  their  input 
stream  (s).  We  also  have  to  be  aware  of  how  flexible  they  axe  in  synchronising 
with  their  environments,  as  they  generate  and  consume  those  streams.  The  base 
level  components  Id,  Succ,  Plus  and  Delta  each  demand  one  input  (or  pair  of 
inputs)  before  generating  one  output  (or  pair  of  outputs).  Tail  demands  two 
inputs  before  its  first  output,  but  thereafter  gives  one  output  for  each  input. 
This  effect  carries  over  into  Pairs,  Integrate  adds  2-place  buffering  between 
its  input  and  output  channels  (ignoring  the  transformation  in  the  actual  values 
passed).  Numbers  will  always  deliver  to  anything  trying  to  take  input  from  it. 

If  necessary,  we  can  make  these  synchronisation  properties  mathematically 
precise.  That  is,  after  all,  one  of  the  reasons  for  which  CSP  was  designed. 

2.5  Deadlock  -  First  Contact 

Consider  the  circuit  in  Figure  5.  A  simple  stream  analysis  would  indicate  that: 


Pairs2 : 

a<i> 

= 

in<i> 

Pairs2: 

b<i> 

= 

in<i> 

Pairs2: 

c<i> 

= 

b<i  +  1>  =  in<i  +  1> 

Pairs2: 

d<i> 

= 

c<i  +  1>  =  in<i  +  2> 

Pairs2: 

out<i> 

= 

a<i>  +  d<i>  =  in<i>  +  in<i  +  2> 
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PalraS  (ia,  out) 


Fig.  5.  A  dangerous  circuit 


But  this  analysis  only  shows  what  would  be  generated  i/ anything  were  gen¬ 
erated.  In  this  case,  nothing  is  generated  since  the  system  deadlocks.  The  two 
Tail  processes  demand  three  items  from  Delta  before  delivering  anything  to 
Plus.  But  Delta  can’t  deliver  a  third  item  to  the  Tails  until  it’s  got  rid  of  its 
second  item  to  Plus.  But  Plus  won’t  accept  a  second  item  from  Delta  until  it’s 
had  its  first  item  from  the  Tails.  Deadlock! 

In  this  case,  deadlock  can  be  designed  out  by  inserting  an  Id  process  on 
the  upper  (a)  channel.  Id  processes  (and  FIFOs  in  general)  have  no  impact  on 
stream  contents  analysis  but,  by  allowing  a  more  decoupled  synchronisation,  can 
impact  on  whether  streams  actually  flow.  Beware,  though,  that  adding  buffering 
to  channels  is  not  a  general  cure  for  deadlock. 

So,  there  are  always  two  questions  to  answer:  what  data  flows  through  the 
channels,  assuming  data  does  flow,  and  are  the  circuits  deadlock-free?  Deadlock 
is  a  monster  that  must  -  and  can  -  be  vanquished.  In  CSP,  deadlock  only  occurs 
from  a  cycle  of  committed  attempts  to  communicate  (input  or  output);  each  pro¬ 
cess  in  the  cycle  refusing  its  predecessor’s  call  as  it  tries  to  contact  its  successor. 
Deadlock  potential  is  very  visible  -  we  even  have  a  deadlock  primitive  (STOP)  to 
represent  it,  on  the  grounds  that  it  is  a  good  idea  to  know  your  enemy! 

In  practice,  there  now  exist  a  wealth  of  design  rules  that  provide  formally 
proven  guarantees  of  deadlock  freedom[17-22].  Design  tools  supporting  these 
rules  -  both  constructive  and  analytical  -  have  been  researched [23, 24).  Deadlock, 
together  with  related  problems  such  as  livelock  and  starvation,  need  threaten  us 
no  longer  -  even  in  the  most  complex  of  parallel  system. 


2.6  Structured  Plug  and  Play 

Consider  the  circuits  of  Figure  6.  They  are  similar  to  the  previous  circuits, 
but  contain  components  other  than  those  from  our  base  catalogue  -  they  use 
components  we  have  just  constructed.  Here  is  the  CSP: 

Fibonacci  (out) 

=  prefix  (0,  d,  a)  II  prefix  (1,  a,  b)  II 
delta  (b,  out,  c)  II  pairs  (c,  d) 

Squares  (out) 

=  Numbers  (a)  I  I  Integrate  (a,  b)  I  I  Pairs  (b,  out) 
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Demo  (out) 

=  Numbers  (a)  I  I  Fibonacci  (b)  I  I  Squares  (c)  II 
Tabulates  (a,  b,  c,  out) 


Squares  (out) 


Fig.  6.  Circuits  of  circuits 


One  of  the  powers  of  CSP  is  that  its  semantics  obey  simple  composition  rules. 
To  understand  the  behaviour  implemented  by  a  network,  we  only  need  to  know 
the  behaviour  of  its  nodes  -  not  their  implementations. 

For  example,  Fibonacci  is  a  feedback  loop  of  four  components.  At  this  level, 
we  can  remain  happily  ignorant  of  the  fact  that  its  Pairs  node  contains  another 
three.  We  only  need  to  know  that  it  requires  two  numbers  before  it  outputs 
anything  and  that,  thereafter,  it  outputs  once  for  every  input.  The  two  Prefixes 
initially  inject  two  numbers  (0  and  1)  into  the  circuit.  Both  go  into  Pairs, 
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but  only  one  (their  sum)  emerges.  After  this,  the  feedback  loop  just  contains  a 
single  circulating  packet  of  information  (successive  elements  of  the  Fibonacci 
sequence).  The  Delta  process  taps  this  circuit  to  provide  external  output. 

Squares  is  a  simple  pipeline  of  three  components.  It’s  best  not  to  think  of 
the  nine  processes  actually  involved.  Clearly,  for  i  >=  1: 

Squares:  a<i>  =  i  -  1 

Squares:  b<i>  =  Sum  fj  -  1  |  j  =  1. .i}  =  Sum  f j  |  j  =  0..(i  -  1)} 

Squares:  out<i>  =  Sum  {j  I  j  =  0..(i  -  1)}  +  Sum  {j  |  j  =  0..i}  =  i  ♦  i 

So,  Squares  outputs  the  increasing  sequence  of  squared  natural  numbers.  It 
doesn’t  deadlock  because  Integrate  and  Pairs  only  add  buffering  properties 
and  it’s  safe  to  connect  buffers  in  series. 

Tabulates  is  from  our  base  catalogue.  Like  the  others,  it  is  cyclic.  In  each 
cycle,  it  inputs  in  parallel  one  number  from  each  of  its  three  input  channels  and, 
then,  generates  a  line  of  text  on  its  output  channel  consisting  of  a  tabulated 
(15-wide,  in  this  example)  decimal  representation  of  those  numbers. 

Tabulates  (inO,  ini,  in2,  out) 

=  ((inO  ?  xO  -  SKIP)  II  (ini  ?  xl  -  SKIP)  ||  (in2  ?  x2  -  SKIP)); 

print  (xO,  15,  out);  print  (xl,  15,  out);  println  (x2,  15,  out); 

Tabulates  (inO,  ini,  in2,  out) 

Connecting  the  output  channel  from  Demo  to  a  text  window  displays  three 
columns  of  numbers:  the  natural  numbers,  the  Fibonacci  sequence  and  perfect 
squares. 

It’s  easy  to  understand  all  this  -  thanks  to  the  structuring.  In  fact,  Demo 
consists  of  27  threads  of  control,  19  of  them  permanent  with  the  other  8  being 
repeatedly  created  and  destroyed  by  the  low-level  parallel  inputs  and  outputs 
in  the  Delta,  Plus  and  Tabulates  components.  If  we  tried  to  understand  it  on 
those  terms,  however,  we  would  get  nowhere. 

Please  note  that  we  are  not  advocating  designing  at  such  a  fine  level  of  gran¬ 
ularity  as  normal  practice!  These  are  only  exercises  and  demonstrations  to  build 
up  fluency  and  confidence  in  concurrent  logic.  Having  said  that,  the  process 
management  overheads  for  the  Occam  Demo  executables  are  only  around  30  mi¬ 
croseconds  per  output  line  of  text  (i.e.  too  low  to  see)  and  three  milliseconds 
for  the  Java  (still  tm)  low  to  see).  And,  of  course,  if  we  are  using  these  tech¬ 
niques  for  designm*  real  bardware[25],  we  will  be  working  at  much  finer  levels 
of  granularity  than  ttu>. 


2.7  Coping  with  tb»  Real  World  —  Making  Choices 

The  model  we  have  ( iHi%i<ieTed  so  far  -  parallel  processes  communicating  through 
dedicated  (point-io-jx  unt  <  hcinnels  -  is  deterministic.  If  we  input  the  same  data 
in  repeated  runs,  we  » lil  always  receive  the  same  results.  This  is  true  regardless 
of  how  the  processes  a/e  -«  heduled  or  distributed.  This  provides  a  very  stable 
base  from  which  to  explore  the  real  world,  which  doesn’t  always  behave  like  this. 
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Any  machine  with  externally  operatable  controls  that  influence  its  internal 
operation,  but  whose  internal  operations  will  continue  to  run  in  the  absence  of 
that  external  control,  is  not  deterministic  in  the  above  sense.  The  scheduling  of 
that  external  control  will  maJce  a  difference.  Consider  a  car  and  its  driver  heading 
for  a  brick  waJl.  Depending  on  when  the  driver  applies  the  brakes,  they  will  end 
up  in  very  different  states! 

CSP  provides  operators  for  internal  and  external  choice.  An  external  choice 
is  when  a  process  waits  for  its  environment  to  engage  in  one  of  several  events  - 
what  happens  next  is  something  the  environment  can  determine  (e.g.  a  driver 
can  press  the  accelerator  or  brake  pedal  to  make  the  car  go  faster  or  slower). 
An  internal  choice  is  when  a  process  changes  state  for  reasons  its  environment 
cannot  determine  (e.g.  a  self-clocked  timeout  or  the  car  runs  out  of  petrol).  Note 
that  for  the  combined  (parallel)  system  of  car-and-driver,  the  accelerating  and 
braking  become  internal  choices  so  far  as  the  rest  of  the  world  is  concerned. 

Occam  provides  a  constructor  (ALT)  that  lets  a  process  wait  for  one  of  many 
events.  These  events  are  restricted  to  channel  input,  timeouts  and  SKIP  (a  null 
event  that  has  always  happened).  We  can  also  set  pre-conditions  -  run-time  tests 
on  internal  state  -  that  mask  whether  a  listed  event  should  be  included  in  any 
particular  execution  of  the  ALT.  This  allows  very  flexible  internal  choice  within  a 
component  as  to  whether  it  is  prepared  to  accept  an  external  communication^. 
The  JavaPP  libraries  provide  an  exact  analogue  (Alternative .  select)  for  these 
choice  mechanisms. 

If  several  events  are  pending  at  an  ALT,  an  internal  choice  is  normally  made 
between  them.  However,  Occam  allows  a  PRI  ALT  which  resolves  the  choice  be¬ 
tween  pending  events  in  order  of  their  listing.  This  returns  control  of  the  opera¬ 
tion  to  the  environment,  since  the  reaction  of  the  PRI  ALTing  process  to  multiple 
communications  is  now  predictable.  This  control  is  crucial  for  the  provision  of 
real-time  guarantees  in  multi-process  systems  and  for  the  design  of  hardware. 
Recently,  extensions  to  CSP  to  provide  a  formal  treatment  of  these  mechanisms 
have  been  made[26,27]. 


Replac*  I  la.  Mt .  Inject) 

Fig.  7.  Two  control  processes 


^  This  is  in  contrast  t..  >r.  .Tut.irs,  whose  methods  cannot  refuse  an  external  call  when 
they  Eire  unlocked  ami  t.av.-  to  wait  on  condition  variables  should  their  state  prevent 
them  from  servicing  » !.•  aJl  The  close  coupling  necessary  between  sibling  monitor 
methods  to  undo  the  result  mg  mess  is  not  WYSIWYG[9]. 
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Figure  7  shows  two  simple  components  with  this  kind  of  control.  Replace 
listens  for  incoming  data  on  its  in  and  inject  lines.  Most  of  the  time,  data 
arrives  from  in  and  is  immediately  copied  to  its  out  line.  Occasionally,  a  signal 
from  the  inject  line  occurs.  When  this  happens,  the  signal  is  copied  out  but, 
at  the  same  time,  the  next  input  from  in  is  waited  for  and  discarded.  In  case 
both  inject  and  in  communications  are  on  offer,  priority  is  given  to  the  (less 
frequently  occurring)  inject: 


Replace  (in,  inject,  out) 

=  (inject  ?  signal  — >  ((in  ?  x  -->  SKIP)  ||  (out  !  signal  -->  SKIP)) 
[PRI] 

in  ?  X  ~>  out  !  X  — >  SKIP 
): 

Replace  (in,  inject,  out) 


Replace  is  something  that  can  be  spliced  into  any  channel.  If  we  don’t  use 
the  inject  line,  all  it  does  is  add  a  one-place  buffer  to  the  circuit.  If  we  send 
something  down  the.  inject  line,  it  gets  injected  into  the  circuit  -  replacing  the 
next  piece  of  data  that  would  have  travelled  through  that  channel. 


- ► 

out 


RMumbers  (out,  reset) 


RIntegrate  (In,  out,  reset) 
Fig.  8.  Two  controllable  processes 
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Figure  8  shows  RNumbers  and  Rlntegrate,  which  are  just  Numbers  and 
Integrate  with  an  added  Replace  component.  We  now  have  components  that 
are  resettable  by  their  environments.  RNumbers  can  be  reset  at  any  time  to 
continue  its  output  sequence  from  any  chosen  value.  Rlntegrate  can  have  its 
internal  running  sum  redefined. 

Like  Replace,  Scale  (figure  7)  normally  copies  numbers  straight  through, 
but  scales  them  by  its  factor  m.  An  inject  signal  resets  the  scale  factor; 

Scale  (m,  in,  inject,  out) 

=  (inject  ?  m  — >  SKIP 
[PRI] 

in  ?  X  — >  out  !  m*x  — >  SKIP 
): 

Scale  (m,  in,  inject,  out) 

Figure  9  shows  RPairs,  which  is  Pairs  with  the  Scale  control  component 
added.  If  we  send  just  +1  or  -1  down  the  reset  line  of  RPairs,  we  control  whether 
it’s  adding  or  subtracting  successive  pairs  of  inputs.  When  it’s  subtracting,  its 
behaviour  changes  to  that  of  a  differentiator  -  in  the  sense  that  it  undoes  the 
effect  of  Integrate. 


RPairs  (in,  out,  reset) 

Fig.  9.  Sometimes  Pairs,  sometimes  Differentiate 


This  allows  a  nice  control  demonstration.  Figure  10  shows  a  circuit  whose 
core  is  a  resettable  version  of  the  Squares  pipeline.  The  Monitor  process  reacts 
to  characters  from  the  keyboard  channel.  Depending  on  its  value,  it  outputs  an 
appropriate  signal  down  an  appropriate  reset  channel: 


Monitor  (keyboard,  resetN,  resetl,  resetP) 


*  (keyboard  ?  ch  — > 
CASE  ch 

‘N’ :  resetN  ! 
‘I’:  resetl  ! 
‘ :  resetP  ! 
resetP  ! 

); 

Monitor  (keyboard. 


0  — >  SKIP 
0  — >  SKIP 
+1  — >  SKIP 
-1  — >  SKIP 

resetN,  resetl,  resetP) 
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keyboard 


I>eino2  (keyboard,  acreen) 

Fig.  10.  A  user  controllable  machine 

When  Demo2  runs  and  we  don’t  type  anything,  we  see  the  inner  workings  of 
the  Squares  pipeline  tabulated  in  three  columns  of  output.  Keying  in  an  ‘N’, 
‘I’,  “+’  or  character  allows  the  user  some  control  over  those  workings®.  Note 
that  after  a  the  output  from  RPairs  should  be  the  same  as  that  taken  from 
RNumbers. 

2.8  A  Nastier  Deacilock 

One  last  exercise  should  be  done.  Modify  the  system  so  that  output  freezes  if  an 
‘F’  is  typed  and  unfreezes  following  the  next  character. 

Two  ‘solutions’  offer  themselves  and  Figure  11  shows  the  wrong  one  (Demo3). 
This  feeds  the  output  from  Tabulates  back  to  a  modified  Monitor2  and  then  on 
to  the  screen.  The  Monitor2  process  PRI  ALTs  between  the  keyboard  channel 
and  this  feedback: 

Monitor2  (keyboard,  feedback,  resetN,  resetl,  resetP) 

=  (keyboard  ?  ch  — > 

CASE  ch 

...  deal  with  ‘N’,  ‘I’,  as  before 

‘F’:  keyboard  ?  ch  — >  SKIP 
[PRI] 

feedback  ?  x  — >  screen  !  x  — >  SKIP 

): 

Monitor2  (keyboard,  feedback,  resetN,  resetl,  resetP) 

®  In  practice,  we  need  to  add  another  process  after  Tabulates  to  slow  down  the  rate  of 
output  to  around  10  lines  per  second.  Otherwise,  the  user  cannot  properly  appreciate 
the  immedicicy  of  control  that  has  been  obtained. 
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Fig.  11.  A  machine  over  which  we  may  lose  control 


Traffic  will  normally  be  flowing  along  the  feedback-screen  route,  inter¬ 
rupted  only  when  Monitor2  services  the  keyboard.  The  attraction  is  that  if 
an  ‘F’  arrives,  Monitor2  simply  waits  for  the  next  character  (and  discards  it). 
As  a  side-effect  of  this  wanting,  the  screen  traffic  is  frozen. 

But  if  we  implement  this,  we  get  some  worrying  behaviour.  The  freeze  oper¬ 
ation  works  fine  and  so,  probably,  do  the  ‘N’  and  ‘I’  resets.  Sometimes,  however, 
a  ‘+’  or  reset  deadlocks  the  whole  system  -  the  screen  freezes  and  all  further 
keyboard  events  are  refused! 

The  problem  is  that  one  of  the  rules  for  deadlock-free  design  has  been  broken: 
any  data-flow  circuit  must  control  the  number  of  packets  circulating!  If  this  num¬ 
ber  rises  to  the  number  of  sequential  (i.e.  lowest  level)  processes  in  the  circuit, 
deadlock  always  results.  Each  node  will  be  trying  to  output  to  its  successor  and 
refusing  input  from  its  predecessor. 

The  Numbers,  RNumbers,  Integrate,  RIntegrate  and  Fibonacci  networks 
all  contain  data-flow  loops,  but  the  number  of  packets  concurrently  in  flight  is 
kept  at  one®. 

In  Demo3  however,  packets  are  continually  being  generated  within  RNumbers, 
flowing  through  several  paths  to  Monitor2  and,  then,  to  the  screen.  Whenever 
Monitor2  feeds  a  reset  back  into  the  circuit,  deadlock  is  possible  -  although  not 
certain.  It  depends  on  the  scheduling.  RNumbers  is  always  pressing  new  packets 
into  the  system,  so  the  circuits  are  likely  to  be  fairly  full.  If  Monitor2  generates 
a  reset  when  they  are  full,  the  system  deadlocks.  The  shortest  feedback  loop  is 
from  Monitor2,  RPairs.  Tabulates  and  back  to  Monitor2  -  hence,  it  is  the  ‘+’ 
and  inputs  from  keyboard  that  are  most  likely  to  trigger  the  deadlock. 

®  Initially,  Fibonacci  has  two  packets,  but  they  combine  into  one  before  the  end  of 
their  &st  circuit. 
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Demo4  (keyboard,  agraen) 

Fig.  12.  A  machine  over  which  we  will  not  lose  control 


The  design  is  simply  fixed  by  removing  that  feedback  at  this  level  -  see  Demo4 
in  Figure  12.  We  have  abstracted  the  freezing  operation  into  its  own  component 
(and  catalogued  it).  It’s  never  a  good  idea  to  try  and  do  too  many  functions  in 
one  sequential  process.  That  needlessly  constrains  the  synchronisation  freedom 
of  the  network  and  heightens  the  risk  of  deadlock.  Note  that  the  idea  being 
pushed  here  is  that,  unless  there  are  special  circumstances,  parallel  design  is 
safer  and  simpler  than  its  serial  counterpart! 

Demo4  obeys  another  golden  rule:  every  device  should  be  driven  from  its  own 
separate  process.  The  keyboard  and  screen  channels  interface  to  separate  de¬ 
vices  and  should  be  operated  concurrently  (in  Demo3,  both  were  driven  from  one 
sequential  process  -  Monitor2).  Here  are  the  driver  processes  from  Demo4: 

Freeze  (in,  freeze,  out) 

=  (freeze  ?  x  — >  freeze  ?  i  — >  SKIP 
[PRI] 

(in  ?  I  — >  out  !  X  — >  SKIP 

): 

Freeze  (in,  freeze,  out) 

Monitors  (keyboard,  resetN,  resetl,  resetP,  freeze) 

=  (keyboard  ?  ch  — > 

CASE  ch 

...  deal  with  ‘N’,  ‘I’,  '+’,  as  before 

‘F’ :  freeze  !  ch  — >  keyboard  ?  ch  — >  freeze  !  ch  — >  SKIP 

); 

Monitors  (keyboard,  resetN,  resetl,  resetP,  freeze) 
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A  chcinnel  structure  is  just  a  record  (or  object)  holding  two  or  more  CSP 
channels.  Usually,  there  would  be  just  two  channels  -  one  for  each  direction  of 
communication.  The  channel  structure  is  used  to  conduct  a  two-way  conversation 
between  two  processes.  To  avoid  deadlock,  of  course,  they  will  have  to  understand 
protocols  for  using  the  channel  structure  -  such  as  who  speaks  first  and  when  the 
conversation  finishes.  We  call  the  process  that  opens  the  conversation  a  client 
and  the  process  that  listens  for  that  call  a  server^. 


Fig.  13.  A  many-many  shared  channel 


The  CSP  model  is  extended  by  allowing  multiple  clients  and  servers  to  share 
the  same  cheinnel  (or  chcumel  structure)  -  see  Figure  13.  Sanity  is  preserved 
by  ensuring  that  only  one  client  and  one  server  use  the  shared  object  at  any 
one  time.  Clients  wishing  to  use  the  channel  queue  up  first  on  a  client-queue 
(associated  with  the  shared  channel)  -  servers  on  a  server-queue  (also  associated 
with  the  shared  channel).  A  client  only  completes  its  actions  on  the  shared 
channel  when  it  gets  to  the  front  of  its  queue,  finds  a  server  (for  which  it  may 
have  to  wait  if  business  is  good)  and  completes  its  transaction.  A  server  only 
completes  when  it  reaches  the  front  of  its  queue,  finds  a  client  (for  which  it  may 
have  to  wait  in  times  of  recession)  and  completes  its  transaction. 

Note  that  shared  chaimels  -  like  the  choice  operator  between  multiple  events 
-  introduce  scheduling  dependent  non-determinism.  The  order  in  which  processes 
are  granted  access  to  the  shared  channel  depends  on  the  order  in  which  they  join 
the  queues. 

Shared  channels  provide  a  very  efficient  mechanism  for  a  common  form  of 
choice.  Any  server  that  offers  a  non-discriminatory  service®  to  multiple  clients 
should  use  a  shared  channel,  rather  than  ALTing  between  individual  channels 
from  those  clients.  The  shared  channel  has  a  constant  time  overhead  -  ALTing 
is  linear  on  the  number  of  clients.  However,  if  the  server  needs  to  discriminate 
between  its  clients  (e.g.  to  refuse  service  to  some,  depending  upon  its  internal 
state),  ALTing  gives  us  that  flexibility.  The  mechanisms  can  be  efficiently  com¬ 
bined.  Clients  can  be  grouped  into  equal-treatment  partitions,  with  each  group 
clustered  on  its  own  shared  channel  and  the  server  ALTing  between  them. 

®  In  fact,  the  client/server  relationship  is  with  respect  to  the  channel  structure.  A 
process  may  be  both  a  server  on  one  interface  and  a  client  on  another. 

®  Examples  for  such  servers  include  window  managers  for  multiple  animation  processes, 
data  loggers  for  recording  traces  from  multiple  components  from  some  machine,  etc. 
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2.9  Buffered  and  Asynchronous  Communications 

We  have  seen  how  fixed  capacity  FIFO  buffers  can  be  added  as  active  processes 
to  CSP  channels.  For  the  occam  binding,  the  overheads  for  such  extra  processes 
are  negligible. 

With  the  JavaPP  libraries,  the  same  technique  may  be  used,  but  the  channel 
objects  can  be  directly  configured  to  support  buffered  communications  -  which 
saves  a  couple  of  context  switches.  The  user  may  supply  objects  supporting  any 
buffering  strategy  for  channel  configuration,  including  normal  blocking  buffers, 
overwrite- when-full  buffers,  infinite  buffers  and  black-hole  buffers  (channels  that 
can  be  written  to  but  not  read  from  -  useful  for  masking  off  unwanted  outputs 
from  components  that,  otherwise,  we  wish  to  reuse  intact).  However,  the  user 
had  better  stay  aware  of  the  semantics  of  the  channels  thus  created! 

Asynchronous  communication  is  commonly  found  in  libraries  supporting  inter¬ 
processor  message-passing  (such  as  PVM  and  MPI).  However,  the  concurrency 
model  usually  supported  is  one  for  which  there  is  only  one  thread  of  control  on 
each  processor.  Asynchronous  communication  lets  that  thread  of  control  launch 
an  external  communication  and  continue  with  its  computation.  At  some  point, 
that  computation  may  need  to  block  until  that  communication  has  completed. 

These  mechanisms  are  easy  to  obtain  from  the  concurrency  model  we  are 
teaching  (and  which  we  claim  to  be  general).  We  don’t  need  anything  new. 
Asynchronous  sends  are  what  happen  when  we  output  to  a  buffer  (or  buffered 
channel).  If  we  are  worried  about  being  blocked  when  the  buffer  is  full  or  if  we 
need  to  block  at  some  later  point  (should  the  communication  still  be  unfinished), 
we  can  simply  spawn  off  another  process'^  to  do  the  send: 

(out  !  packet  — >  SKIP  IPRII  SomeMoreComputation  (...)); 

Continue  (...) 

The  Continue  process  only  starts  when  both  the  packet  has  been  sent 
and  SomeMoreComputation  has  finished.  SomeMoreComputation  and  sending  the 
packet  proceed  concurrently.  We  have  used  the  priority  version  of  the  parallel 
operator  ( I PRI  | ,  which  gives  priority  to  its  left  operand),  to  ensure  that  the  send¬ 
ing  process  initiates  the  transfer  before  the  SomeMoreComputation  is  scheduled. 
Asynchronous  receives  are  implemented  in  the  same  way: 

(in  ?  packet  — >  SKIP  IPRII  SomeMoreComputation  (...)); 

Continue  (...) 


2.10  Shared  Channels 

CSP  channels  are  strictly  point-to-point.  occam3[28]  introduced  the  notion  of 
(securely)  shared  channels  and  channel  structures.  These  are  further  extended 
in  the  KRoC  occam[29]  and  JavaPP  libraries  and  are  included  in  the  teaching 
model. 

^  The  Occam  overheads  for  doing  this  are  less  than  half  a  microsecond. 
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For  deadlock  freedom,  each  server  must  guarantee  to  respond  to  a  client  call 
within  some  bounded  time.  During  its  transaction  with  the  client,  it  must  follow 
the  protocols  for  communication  defined  for  the  channel  structure  and  it  may 
engage  in  separate  client  transactions  with  other  servers.  A  client  may  open  a 
transaction  at  any  time  but  may  not  interleave  its  communications  with  the 
server  with  any  other  synchronisation  (e.g.  with  another  server).  These  rules 
have  been  formalised  as  CSP  specifications[21].  Client-server  networks  may  have 
plenty  of  data-flow  feedback  but,  so  long  as  no  cycle  of  client-server  relations 
exist,  [21]  gives  formal  proof  that  the  system  is  deadlock,  livelock  and  starvation 
free. 

Shared  channel  structures  may  be  stretched  across  distributed  memory  (e.g. 
networked)  multiprocessors[15].  Channels  may  carry  all  kinds  of  object  -  includ¬ 
ing  channels  and  processes  themselves.  A  shared  channel  is  an  excellent  means  for 
a  client  and  server  to  find  each  other,  pass  over  a  private  channel  and  communi¬ 
cate  independently  of  the  shared  one.  Processes  will  drag  pre-attached  channels 
with  them  as  they  are  moved  and  can  have  local  channels  dynamically  (and 
temporarily)  attached  when  they  arrive.  See  David  May’s  work  on  Icarus[30, 31] 
for  a  consistent,  simple  and  practical  realisation  of  this  model  for  distributed 
and  mobile  computing. 

3  Events  and  Shared  Memory 

Shared  memory  concurrency  is  often  described  as  being  ‘easier’  than  message 
passing.  But  great  care  must  be  taken  to  synchronise  concurrent  access  to  shared 
data,  else  we  will  be  plagued  with  race  hazards  and  our  systems  will  be  useless. 
CSP  primitives  provide  a  sharp  set  of  tools  for  exercising  this  control. 


3.1  Symmetric  Multi-Processing  (SMP) 

The  private  memory/algorithm  principles  of  the  underlying  model  -  and  the 
security  guarantees  that  go  with  them  -  are  a  powerful  way  of  programming 
shared  memory  multiprocessors.  Processes  can  be  automatically  and  dynami¬ 
cally  scheduled  between  available  processors  (one  object  code  fits  alt).  So  long 
as  there  is  an  excess  of  (runnable)  processes  over  processors  and  the  scheduling 
overheads  are  sufficiently  low,  high  multiprocessor  efficiency  can  be  achieved  - 
with  guaranteed  no  race  hazards.  With  the  design  methods  we  have  been  de¬ 
scribing,  it’s  very  easy  to  generate  lots  of  processes  with  most  of  them  runnable 
most  of  the  time. 

3.2  Token  Passing  and  Dynamic  CREW 

Taking  advantage  of  shared  memory  to  communicate  between  processes  is  an 
extension  to  this  model  and  must  be  synchronised.  The  shared  data  does  not 
belong  to  any  of  the  sharing  processes,  but  must  be  globally  visible  to  them  - 
either  on  the  stack  (for  Occam)  or  heap  (for  Java). 
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The  JavaPP  channels  in  previous  examples  were  only  used  to  send  data  values 
between  processes  -  but  they  can  also  be  used  to  send  objects.  This  steps  outside 
the  automatic  guarantees  against  race  hazard  since,  unconstrained,  it  allows 
parallel  access  to  the  same  data.  One  common  and  useful  constraint  is  only  to 
send  immutable  objects.  Another  design  pattern  treats  the  sent  object  as  a  token 
conferring  permission  to  use  it  -  the  sending  process  losing  the  token  as  a  side- 
effect  of  the  communication.  The  trick  is  to  ensure  that  only  one  copy  of  the 
token  ever  exists  for  each  sharable  object. 

Dynamic  CREW  (Concurrent  Read  Exclusive  Write)  operations  are  also  pos¬ 
sible  with  shared  memory.  Shared  channels  give  us  an  efficient,  elegant  and  easily 
provable  way  to  construct  an  active  guardian  process  with  which  application  pro¬ 
cesses  synchronise  to  effect  CREW  access  to  the  shared  data.  Guarantees  against 
starvation  of  writers  by  readers  -  and  vice-versa  -  are  made.  Details  will  appear 
in  a  later  report  (available  from  [32]). 


3.3  Structured  Barrier  Synchronisation  and  SPMD 

Point-to-point  channels  are  just  a  specialised  form  of  the  general  CSP  multi¬ 
process  synchronising  event.  The  CSP  parallel  operator  binds  processes  together 
with  events.  When  one  process  synchronises  on  an  event,  all  processes  registered 
for  that  event  must  synchronise  on  it  before  that  first  process  may  continue. 
Events  give  us  structured  multiway  barrier  synchronisation[29]. 


bO  b2  bO  bl  bO  b2  bO  bl 


Fig  14.  Multiple  barriers  to  three  processes 


We  can  have  nian\  e\-ent  barriers  in  a  system,  with  different  (and  not  neces¬ 
sarily  disjoint)  sub!**’t»  of  processes  registered  for  each  barrier.  Figure  14  shows 
the  execution  trao-s  fur  three  processes  (P,  M  and  D)  with  time  flowing  horizon¬ 
tally.  They  do  imt  ^il  pf^ress  at  the  same  -  or  even  constant  -  speed.  From 
time  to  time,  tha  f.i.-t.-i  ■()••«.  will  have  to  wait  for  their  slower  partners  to  reach 
an  agreed  barrier  U'h.t*  Ail  of  them  can  proceed.  We  can  wrap  up  the  system  in 
typical  SPMD  form  .o 

II  <i  =  0  FOR  3> 

S  (i,  ....  bO.  bl .  b2) 
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where  bO,  bl  and  b2  are  events.  The  replicated  parallel  operator  runs  3  instances 
of  S  in  parallel  (with  i  taking  the  values  0,  1  and  2  respectively  in  the  different 
instances).  The  S  process  simply  switches  into  the  required  form: 

S  (i.  ....  bO,  bl,  b2) 

=  CASE  i 

0  :  P  (...,  bO,  bl) 

1  :  M  (...,  bO,  bl,  b2) 

2  :  D  (.  .  .,  bl,  b2) 

and  where  P,  M  and  D  are  registered  only  for  the  events  in  their  parameters.  The 
code  for  P  has  the  form: 

P  (...,  bO.  bl) 

=  someWork  (...);  bO  — >  SKIP ; 
moreWork  (...);  bO  — >  SKIP ; 
lastBitOfWork  (...);  bl  — >  SKIP; 

P  (...,  bO,  bl) 

3.4  Non-Blocking  Barrier  Synchronisation 

In  the  same  way  that  asynchronous  communications  can  be  expressed  (section 
2.9),  we  can  also  achieve  the  somewhat  contradictory  sounding,  but  potentially 
useful,  non-blocking  barrier  synchronisation. 

In  terms  of  serial  programming,  this  is  a  two-phase  commitment  to  the  bar¬ 
rier.  The  first  phase  declares  that  we  have  done  everything  we  need  to  do  this 
side  of  the  barrier,  but  does  not  block  us.  We  can  then  continue  for  a  while,  doing 
things  that  do  not  disturb  what  we  have  set  up  for  our  partners  in  the  barrier 
and  do  not  need  whatever  it  is  that  they  have  to  set.  When  we  need  their  work, 
we  enter  the  second  phase  of  our  synchronisation  on  the  barrier.  This  blocks  us 
only  if  there  is  one,  or  more,  of  our  partners  who  has  not  reached  the  first  phase 
of  their  synchronisation.  With  luck,  this  window  on  the  barrier  will  enable  most 
processes  most  of  the  time  to  pass  through  without  blocking: 

doOurVorkNeededByOthars  (...); 
beorrier .f irstPhaia  (); 
privateWork  ( .  ) ; 

barrier . sec ondPkaa*  (); 

useSharedResottreesProiectedByTheBarrier  (...); 

With  our  lightufikhf  CSP  processes,  we  do  not  need  these  special  phases  to 
get  the  same  effect 

doOurWorkNeededByOtkert  (...); 

(beirrier  — >  SKIP  ;PBli  privateWork  (...)); 
useSharedResourcekProtactedByTheBarrier  (...); 

The  explanation  a-s  !*j  why  this  works  is  just  the  same  as  for  the  asynchronous 
sends  and  receives. 
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3.5  Bucket  Synchronisation 

Although  CSP  allows  choice  over  general  events,  the  occam  and  Java  bindings 
do  not.  The  reasons  are  practical  -  a  concern  for  run-time  overheads^®.  So, 
synchronising  on  an  event  commits  a  process  to  wait  until  everyone  registered  for 
the  event  has  synchronised.  These  multi-way  events,  therefore,  do  not  introduce 
non-determinism  into  a  system  and  provide  a  stable  platform  for  much  scientific 
and  engineering  modelling. 

Buckets[15]  provide  a  non-deterministic  version  of  events  that  are  useful  for 
when  the  system  being  modelled  is  irregular  and  dynamic  (e.g.  motor  vehicle 
traffic[33]).  Buckets  have  just  two  operations:  jump  and  kick.  There  is  no  limit 
to  the  number  of  processes  that  can  jump  into  a  bucket  -  where  they  all  block. 
Usually,  there  will  only  be  one  process  with  responsibility  for  kicking  over  the 
bucket.  This  can  be  done  at  any  time  of  its  own  (internal)  choosing  -  hence  the 
non-determinism.  The  result  of  kicking  over  a  bucket  is  the  unblocking  of  all  the 
processes  that  had  jumped  into  it^^. 

4  Conclusions 

A  simple  model  for  parcillel  computing  has  been  presented  that  is  easy  to  learn, 
teach  and  use.  Based  upon  the  mathematically  sound  framework  of  Hoare’s  CSP, 
it  has  a  compositional  semantics  that  corresponds  well  with  out  intuition  about 
how  the  world  is  constructed.  The  basic  model  encompasses  object-oriented  de¬ 
sign  with  active  processes  (i.e.  objects  whose  methods  are  exclusively  under  their 
own  thread  of  control)  communicating  via  passive,  but  synchronising,  wires.  Sys¬ 
tems  can  be  composed  through  natural  layers  of  communicating  components  so 
that  an  understanding  of  each  layer  does  not  depend  on  an  understanding  of  the 
inner  ones.  In  this  way,  systems  with  arbitrarily  complex  behaviour  can  be  safely- 
constructed  -  free  from  race  hazard,  deadlock,  livelock  and  process  starvation. 

A  small  extension  to  the  model  addresses  fundamental  issues  and  paradigms 
for  shared  memory  concurrency  (such  as  token  passing,  CREW  dynamics  and 
bulk  synchronisation) .  We  can  explore  with  equal  fluency  serial,  message-passing 
and  shared-memory  logic  and  strike  whatever  balance  between  them  is  appro¬ 
priate  for  the  problem  under  study.  Applications  include  hardware  design  (e.g. 
FFGAs  and  ASICs),  reabtime  control  systems,  animation,  GUIs,  regular  and 
irregular  modelling,  distributed  and  mobile  computing. 

Occam  and  Java  bindings  for  the  model  are  available  to  support  practical 
work  on  commodity  PCs  and  workstations.  Currently,  the  occam  bindings  are 

Synchronising  on  an  event  in  occam  has  a  unit  time  overhead,  regEirdless  of  the  num¬ 
ber  of  processes  registered.  This  includes  being  the  last  process  to  synchronise,  when 
all  blocked  processes  axe  released.  These  overheads  are  well  below  a  microsecond  for 
modern  microprocessors. 

“  As  for  events,  the  jump  and  kick  operations  have  constant  time  overhead,  regardless 
of  the  number  of  processes  involved.  The  bucket  overheads  are  slightly  lower  than 
those  for  events. 


430 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


the  fastest  (context-switch  times  under  300  nano-seconds),  lightest  (in  terms 
of  memory  demands),  most  secure  (in  terms  of  guaranteed  thread  safety)  and 
quickest  to  learn.  But  Java  has  the  libraries  (e.g.  for  GUIs  and  graphics)  and 
will  get  faster.  Java  thread  safety  depends  on  following  the  CSP  design  patterns, 
but  these  are  easy  to  acquire^^. 

The  JavaPP  JCSP  library[ll]  also  includes  an  extension  to  the  Java  AWT 
package  that  drops  channel  interfaces  on  all  GUI  components'^.  Each  item  (e.g. 
a  Button)  is  a  process  with  a  conf  igure  and  action  channel  interface.  These  are 
connected  to  separate  internal  handler  processes.  To  change  the  text  or  colour 
of  a  Button,  an  application  process  outputs  to  its  configure  channel.  K  some¬ 
one  presses  the  Button,  it  outputs  down  its  action  channel  to  an  application 
process  (which  can  accept  or  refuse  the  communication  as  it  chooses).  Exam¬ 
ple  demonstrations  of  the  use  of  this  package  may  be  found  at  [11].  Whether 
GUI  programming  through  the  process-channel  design  pattern  is  simpler  than 
the  listener-callback  pattern  offered  by  the  underlying  AWT,  we  leave  for  the 
interested  reader  to  experiment  and  decide. 

All  the  primitives  described  in  this  paper  are  available  for  KRoC  occam  and 
Java.  Multiprocessor  versions  of  the  KRoC  kernel  targeting  NOWs  and  SMPs 
will  be  available  later  this  year.  SMP  versions  of  the  JCSP[11]  and  CJT[12] 
libraries  are  automatic  if  your  JVM  supports  SMP  threads.  Hooks  are  provided 
in  the  channel  libraries  to  allow  user-defined  network  drivers  to  be  installed. 
Research  is  continuing  on  portable/faster  kernels  and  language/tool  design  for 
enforcing  higher  level  aspects  of  CSP  design  patterns  (e.g.  for  shared  memory 
safety  and  deadlock  freedom)  that  currently  rely  on  self-discipline. 

Finsdly,  we  stress  that  this  is  undergraduate  material.  The  concepts  are  ma¬ 
ture  and  fundamental  -  not  advanced  -  and  the  earlier  they  are  introduced  the 
better.  For  developing  fluency  in  concurrent  design  and  implementation,  no  spe¬ 
cial  hardware  is  needed.  Students  can  graduate  to  real  parallel  systems  once  they 
have  mastered  this  fluency.  The  CSP  model  is  neutral  with  respect  to  parallel 
architecture  so  that  coping  with  a  change  in  language  or  paradigm  is  straight¬ 
forward.  However,  even  for  uni-processor  applications,  the  ability  to  do  safe  and 
lightweight  multithreading  is  becoming  crucial  both  to  improve  response  times 
and  simplify  their  design. 

The  experience  at  Kent  is  that  students  absorb  these  ideas  very  quickly  and 
become  very  creative^"*.  Now  that  they  can  apply  them  in  the  context  of  Java, 
they  are  smiling  indeed. 

Java  active  object  (i.e.  processes)  do  not  invoke  each  other’s  methods  and  commu¬ 
nicate  only  through  shared  passive  objects  with  carefully  designed  synchronisation 
properties  (e.g.  channels  and  events).  Shared  use  of  user-defined  passive  objects  will 
be  automatically  thread-safe  so  long  as  the  shared  memory  usage  patterns  are  kept. 
We  do  not  need  to  get  involved  with  the  monitor  model  within  Java. 

We  believe  that  the  new  Swing  GUI  libraries  from  Sun  (that  will  replace  the  AWT) 
can  also  be  extended  through  a  channel  interface  for  secure  use  in  parallel  designs  - 
despite  the  warnings  concerning  the  use  of  Swing  and  multithreading[34]. 

The  JCSP  libraries  used  in  Appendix  B  were  produced  by  Paul  Austin,  an  under¬ 
graduate  student  at  Kent. 
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Appendix  A:  occam  Executables 

Space  only  permits  a  sample  of  the  examples  to  be  shown  here.  This  first  group  are 
from  the  ‘Legoland’  catalogue  (Section  2.3): 

PROC  Id  (CHAN  OF  INT  in,  out) 

WHILE  TRUE 
INT  x: 

SEQ 

in  ?  X 
out  !  X 


PROC  Plus  (CHAN  OF  INT  inO,  ini,  out) 
WHILE  TRUE 
INT  xO,  xl: 

SEQ 

PAR 

inO  ?  xO 
ini  ?  xl 

out  !  xO  PLUS  xl 


PROC  Prefix  (VAL  INT  n,  CHAN  OF  INT  in,  out) 
SEQ 

out  !  n 
Id  (in,  out) 


PROC  Succ  (CHAN  OF  INT  in,  out) 
WHILE  TRUE 
INT  x: 

SEQ 

in  ?  X 

out  !  X  PLUS  1 


‘Plug  and  Play’  examples  firom  Sections  2.4  and  2.6: 


Next  come  four  two  of  the 

PROC  Numbers  (CHAN  OF  INT  out) 
CHAN  OF  INT  a.  b,  c: 

PAR 

Prefix  (0,  c,  a) 

Delta  (a,  out,  b) 

Succ  (b,  c) 


PROC  Integrate  (CHAN  OF  INT  in,  out) 
CHAN  OF  INT  a,  b,  c: 

PAR 

Plus  (in,  c,  a) 

Delta  (a,  out,  b) 

Prefix  (0,  b,  c) 


PROC  Pairs  (CHAN  OF  INT  in,  out) 
CHAN  OF  INT  a,  b,  c: 

PAR 

Delta  (in,  a,  b) 

Tail  (b,  c) 

Plus  (a,  c,  out) 


PROC  Squares  (CHAN  OF  INT  out) 
CHAN  OF  INT  a,  b: 

PAR 

Numbers  (a) 

Integrate  (a,  b) 

Pairs  (b,  out) 
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Here  is  one  of  the  controllers  from  Section  2.7: 

PROC  Replace  (CHAN  OF  INT  in,  inject,  out) 
WHILE  TRUE 
PRI  ALT 
INT  x: 
inject  ?  X 
PAR 

INT  discard: 
in  ?  discanrd 
out  !  X 
INT  x: 
in  ?  X 
out  !  X 


Asynchronous  receive  from  Section  2.9: 


PRI  PAR 

in  ?  packet 

SomeMoreComputation  ( _ } 

Continue  (...) 

Barrier  synchronisation  from  Section  3.3: 

PROC  P  (. . . .  EVENT  bO.  b2) 

local  state  declarations 
SEQ 

. . .  initialise  local  state 
WHILE  TRUE 
SEQ 

someWork  (...) 
synchronise . event  (bO) 
moreWork  (...) 
synchronise. event  (bO) 
lastBitOf Work  (...) 
synchronise . event  (bl) 


Finally,  non-blocking  barrier  synchronisation  from  Section  3.4: 


doOurWorkNeededByOthers  (...) 

PRI  PAR 

synchronise. event  (barrier) 
privateWork  (...) 

useSharedResourcesProtectedByTheBarrier  ( . . . ) 
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Appendix  B:  Java  Executables 

These  examples  use  the  JCSP  library  for  processes  and  channels[ll].  A  process  is  an 
instance  of  a  class  that  implements  the  CSProcess  interface.  This  is  similar  to,  but 
different  from,  the  standard  Runable  interface: 

package  jcsp.lang; 

public  interface  CSProcess  { 

public  abstract  void  run  () ; 

} 

For  example,  from  the  ‘Legoland’  catalogue  (Section  2.3): 

import  jcsp.lang.*; 

class  Succ  implements  CSProcess  { 

private  Cheuinellnputlnt  in; 

private  CheumelOutputInt  out; 

public  Succ(ChannelInputInt  in,  ChannelOutputInt  out)  { 
this. in  =  in; 
this. out  =  out; 

} 

public  void  run()  { 
while  (true)  { 

int  X  =  in. read  (); 
out. write  (x  +  1); 

} 

} 

} 

class  Prefix  implements  CSProcess  { 

private  int  n; 

private  Channelinputint  in; 

private  ChauinelOutputInt  out ; 

public  Prefix(int  n,  Channelinputint  in,  ChannelOutputInt  out)  { 
this.n  =  n; 
this. in  =  in; 
this . out  =  out ; 

> 

public  void  runO  { 
out. write  (n) ; 
new  Id  (in,  out). run  (); 

} 

} 
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JCSP  provides  a  Parallel  class  that  combines  an  array  of  CSProcesses  into  a  CSProcess. 
It’s  execution  is  the  parallel  composition  of  that  array.  For  example,  here  are  two  of 
the  ‘Plug  and  Play’  examples  from  Sections  2.4  and  2.6: 

class  Numbers  implements  CSProcess  { 

private  ChannelOutputInt  out; 

public  Numbers  (ChannelOutputInt  out)  { 
this . out  =  out ; 

} 

public  void  run()  { 

0ne20neChannelInt  a  -  new  0ne20neChannelInt  (); 

0ne20neChannellnt  b  new  One20neChannelInt  (); 

One20neChannelInt  c  -  new  0ne20neChannelInt  0 ; 
new  Parallel  ( 
new  CSProcess  []  { 

new  Delta  (a,  out,  b) , 
new  Succ  (b,  c) , 
new  Prefix  (0,  c,  a), 

} 

)  .runO  ; 

} 

} 

class  Squares  implements  CSProcess  { 

private  ChannelOutputInt  out; 

public  Squares  (ChannelOutputInt  out)  { 
this . out  =  out : 

} 

public  void  runO  { 

0ne20neChannelInt  a  =  new  One20neChannelInt  0 ; 

0ne20neChannelInt  b  =  new  One20neChannelInt  0 ; 
new  P^mallel  ( 

new  CSProcess  []  { 
new  Numbers  (a) , 
new  Integrate  (a,  b) , 
new  Pairs  (b,  out), 

} 

)  .runO ; 

} 

} 
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Here  is  one  of  the  controllers  from  Section  2.7.  The  processes  Readint  and  Writeint 
just  read  and  write  a  single  integer  (from  and  to  a  public  value  field): 

class  Replace  implements  CSProcess  { 

private  AltingChannelInputInt  in; 
private  AltingChannelInputInt  inject; 
private  ChaomelOutputlnt  out; 

public  Replace  (AltingChannelInputInt  in, 

AltingChannelInputInt  inject, 

ChannelOutputInt  out)  { 

this. in  =  in; 
this. inject  =  inject; 
this . out  =  out ; 

} 

public  void  run()  { 

Alternative  alt  =  new  Alternative () ; 

AltingChannelInputInt □  altChans  =  {inject,  in}; 

CSProcess  writeint  =  new  Writeint  (out) ; 

CSProcess  readint  =  new  Readint  (in) ; 

CSProcess  parlO  =  new  Parallel  (new  CSProcess []  {readint,  writeint}); 
while  (true)  { 

switch  (alt. select  (altChans))  { 
case  0: 

writeint. value  =  inject. read  () ; 
peirlO.run  ()  ; 
break; 
case  1: 

out. write  (in. read  ()); 
break; 

} 

} 

} 

} 


JCSP  also  has  chann*-l>  fi,r  sending  and  receiving  arbitrary  Objects.  Here  is  an  asyn¬ 
chronous  receive  (frum  S«s  tion  2.9)  of  cin  expected  Packet: 

//  set  up  procetse*  once  (before  we  start  looping  ...) 

CSProcess  readObj  •  nee  ReadObj  (in); 

CSProcess  somaHore  •  new  SomeMoreComputation  (...); 

CSProcess  async  •  new  PrxParallel  (new  CSProcess []  {readObj,  someMore}) ; 

while  (looping)  ( 
async. run  () ; 

Packet  packet  •  Peciet)  readObj .object 
Continue  (...). 

} 


438 


VECPAR  '98  ■  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


An  ISA  comparison  between  Superscalar  and 
Vector  Processors 


Francisca  Quintana^  Roger  Espasa^,  and  Mateo  Valero^ 

^  University  of  Las  Palmas  de  Gran  Canaria,  Edificio  de  Informatica  y  Matematicas,  Cam¬ 
pus  de  Tafira,  35017  Las  Palmas  de  Gran  Canaria,  Canary  Islands,  Spain 
fquintan@dis.ulpgc.es 

^U.  Politecnica  Catalunya-Barcelona,  Computer  Architecture  Department,  Campus  Nord 

{roger,mateo}  @ac.upc.es 


Abstract.  This  paper  presents  a  comparison  between  superscalar  and  vector 
processors.  First,  we  start  with  a  detailed  ISA  analysis  of  the  vector  machine, 
including  data  related  to  masked  execution,  vector  length  and  vector  first  fa¬ 
cilities.  Then  we  present  a  comparison  of  the  two  models  at  the  instruction  set 
architecture  (ISA)  level  that  shows  that  the  vector  model  has  several  advan¬ 
tages;  executes  fewer  instructions,  fewer  overall  operations,  and  generally  exe¬ 
cutes  fewer  memory  accesses.  We  then  analyse  both  models  in  terms  of  specu¬ 
lative  execution,  each  one  in  its  context.  Results  show  that  superscalar  proces¬ 
sors  make  an  extensive  use  of  speculation  and  that  there  is  a  large  ammount  of 
misspeculated  instructions.  In  the  vector  model,  speculation  is  achieved  using 
vector  masks  and,  in  general,  fewer  operations  are  misspeculated. 


1  Introduction 

Traditionally,  there  have  been  different  approaches  aimed  at  improving  microproces¬ 
sor  performance.  One  of  them  has  been  the  exploitation  of  data  level  parallelism 
(DLP).  The  DLP  paradigm  uses  vectorization  techniques  to  discover  data  level  paral¬ 
lelism  in  a  sequeniialK  specified  program  and  expresses  this  parallelism  using  vector 
instructions[l][2][.3]  A  single  vector  instruction  specifies  a  series  of  operations  to  be 
performed  on  a  stream  ot  data  Each  operation  performed  on  each  individual  element 
is  independent  of  all  others  and.  therefore,  a  vector  instruction  is  easily  pipelineable 
and  highly  parallel|4 1|  ^  Another  approach  aimed  at  reaching  high  performance  in 
a  program’s  execution  o  the  exploitation  of  instruction  level  parallelism  (ILP).  Cur¬ 
rent  state-of-the-art  mun>pfoiessors  all  include  4-wide  fetch  engines  coupled  with 
sophisticated  branch  preduufs.  large  reorder  buffers  to  dynamically  schedule  in¬ 
structions  and  non-bl(Kking  ,a..hes  to  allow  multiple  outstanding  misses.  All  these 
techniques  focus  on  a  single  goal:  executing  several  instructions  that  are  known  to  be 
independent,  in  parallell"’]  The  larger  the  number  of  instructions  that  can  be  launched 
on  each  cycle,  the  better  ihe  performance  achieved. 


439 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


There  are  two  very  important  advantages  in  using  vector  instructions  to  express 
data-level  parallelism.  First,  the  total  number  of  instructions  that  have  to  be  executed 
to  complete  a  program  is  reduced  because  each  vector  instruction  has  more  semantic 
content  that  the  corresponding  scalar  instructions.  Second,  the  fact  that  the  individual 
operations  in  a  single  vector  instruction  are  independent  allows  a  more  efficient  exe¬ 
cution:  once  a  vector  instruction  is  issued  to  a  functional  unit,  it  will  use  it  with  useful 
work  for  many  cycles.  During  those  cycles,  the  processor  can  look  for  other  vector 
instructions  to  be  launched  to  the  same  or  other  functional  units.  It  is  very  likely  that, 
by  the  time  a  vector  instruction  completes  all  its  work,  there  is  already  another  vector 
instruction  ready  to  occupy  the  functional  unit.  Meanwhile,  in  a  scalar  processor, 
when  an  instruction  is  launched  to  a  functional  unit,  another  instruction  is  required  at 
the  very  next  cycle  to  keep  the  functional  unit  busy.  Unfortunately,  many  hazards  can 
get  in  the  way  of  this  requirement;  true  data  dependencies,  cache  misses,  branch  mis- 
speculation,  etc. 

The  combination  of  these  two  effects  has  many  related  advantages.  First,  the  pres¬ 
sure  on  the  fetch  unit  is  greatly  reduced.  By  specifying  many  operations  with  a  single 
instruction,  the  total  number  of  different  instructions  that  have  to  be  fetched  is  re¬ 
duced.  Many  branches  disappear  embedded  in  the  semantics  of  vector  instructions.  A 
second  advantage  is  the  simplicity  of  the  control  unit.  With  relatively  few  control 
effort,  a  vector  architecture  can  control  the  execution  of  many  different  functional 
units,  since  most  of  them  work  in  parallel  in  a  fully  synchronous  way.  A  third  advan¬ 
tage  is  related  to  the  way  the  memory  system  is  accessed:  a  single  vector  instruction 
can  exactly  specify  a  long  sequence  of  memory  addresses.  Consequently,  the  hard¬ 
ware  has  considerable  advance  knowledge  regarding  memory  references,  can  sched¬ 
ule  these  accesses  in  an  efficient  way[8],  and  needs  to  access  no  more  data  than  is 
actually  needed.  In  addition,  a  vector  memory  operation  is  able  to  amortize  start-up 
latencies  over  a  potentially  long  stream  of  vector  elements. 

In  this  paper  we  make  a  comparison  between  vector  and  superscalar  processors  by 
analysing  the  behaviour  of  a  Mips  R]0000[9]  superscalar  processor  and  a  Convex 
C4[I0]  vector  processor.  This  study  is  carried  out  from  different  points  of  view.  First 
of  all  we  introduce  an  initial  analysis  of  the  Convex  C4  vector  processor.  This  in¬ 
cludes  an  overview  of  several  intrinsic  characteristics  of  vector  processing:  we  will 
analyze  the  effect  of  execution  under  mask  and  execution  using  the  vector  first  facil¬ 
ity.  Then  we  will  compare  the  superscalar  and  vector  approaches  from  the  ISA  point 
of  view.  We  will  present  data  about  the  number  of  instructions  and  operations  exe¬ 
cuted  in  both  processors.  Finally,  we  will  present  a  comparison  about  speculative 
execution  in  the  two  approaches. 


2  Convex  C4  Analysis 


We  will  start  by  analyzing  the  vector  length  and  vector  mask  facilities  of  vector  proc- 
es.sors.  We  will  also  present  the  vector  first  facility  which  is  specific  of  the  Convex 
C4  machine.  Then  we  will  compare  the  number  of  instructions,  operations  and  mem¬ 
ory  traffic  of  vector  processors  and  superscalars. 
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This  Study  will  be  carried  out  using  the  six  more  vectorizable  programs  from 
Specfp92.  We  have  measured  the  vectorization  percentage  using  the  Dixie  toolfl  1]. 
We  have  generated  the  execution  traces  of  the  Specfp92  programs  when  running  on  a 
Convex  C4  machine,  and  then  we  have  used  the  Jinks  simulator  to  measure  the 
amount  of  vector  and  scalar  operations  carried  out  by  the  programs.  The  vectorization 
percentage  has  been  calculated  as  the  ratio  between  vector  operations  and  the  addition 
of  vector  and  scalar  operations. 


2.1  Operation  Distribution 

Table  1  presents  the  basic  operation  distribution  for  the  five  more  vectorizable  pro¬ 
grams  of  the  Specfp92.  First  column  shows  the  total  number  of  basic  blocks  (in  mil¬ 
lions)  executed  for  each  program.  Next  two  columns  present  the  total  number  of  in¬ 
structions  broken  down  into  scalar  and  vector  instructions.  We  will  distinguish  be¬ 
tween  instructions  and  operations.  A  scalar  instruction  performs  only  one  operation, 
while  a  vector  instruction  performs  several  operations,  depending  on  the  value  of  the 
vector  length  (VL)  register.  Fifth  column  is  the  percentage  of  vectorization  for  each 
program,  defined  as  the  ratio  between  the  number  of  vector  operations  and  the  total 
number  of  operations  performed.  Finally  column  sixth  presents  the  average  vector 
length  used  in  vector  instructions.  An  interesting  point  from  this  table  is  the  average 
vector  length  observed  in  the  programs,  which  is  not  heavily  related  to  the  percentage 
of  vectorization. 

Table  1.  Operation  distribution 


Program 

#  basic 

blocks  ' 

#  instructions 

#  vector 
operations 

% 

Vect 

Avg. 

VL 

Scalar 

Vector 

Swm256 

2.57 

27.46 

74.82 

8127.98 

99.7 

93 

Hydro2d 

4.74 

38.85 

35.43 

3684.89 

99.0 

101 

Nasa7 

16.79 

139.80 

55.98 

3885.02 

96.5 

62 

Su2cor 

22.53 

143.95 

24.08 

3066.07 

95.5 

125 

Tomcatv 

19.95 

126.66 

6.37 

644.41 

83.6 

99 

Waves 

48.99 

579.77 

35.88 

1615.04 

73.6 

43 

2.2  Vector  Length  Distributions 

Vector  execution  is  based  on  executing  a  certain  operation  specified  in  one  instruction 
over  a  large  amount  of  independent  data.  The  amount  of  data  specified  in  each  in¬ 
struction  is  dinamically  specified  with  the  value  of  the  Vector  Length  register 
(VL).The  latency  of  the  operation  being  carried  out  is  then  amortized  across  all  VL 
elements.  Therefore,  the  larger  the  VL,  the  better  the  performance.  Fig.  1  presents  the 
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Fig.  1.  VL  Distribution  for  Specfp92  programs 

VL  distribution  for  the  six  more  vectorizable  Specfp92  benchmarks.  As  we  can  see, 
the  vector  length  distributions  follow  several  patterns.  Swim256,  Tomcatv  and  Su2cor 
have  the  majority  of  their  vector  lengths  clustered  around  128.  Hydrold  has  a  single 
dominant  vector  length  which  is  the  number  of  grid  points  used  in  the  z-direction  of 
the  problem.  NasaJ  and  WaveS  have  a  distribution  that  follows  a  staircase,  having 
several  dominant  vector  lengths.  All  this  data  suggest  that  even  among  vectorizable 
programs  the  utilisation  of  the  vector  registers  varies  a  lot. 


2.3  Vector  First  Capability 

A  new  capability  in  the  Convex  C4  processor  is  the  Vector  First  facility  which  allows 
specifying  the  first  element  in  the  vector  register  on  which  the  instruction  will  be 
executed.  That  is,  an  instruction  executes  VL  operations  starting  at  element  VF.  This 
facility  avoids  having  to  reload  data  in  the  cases  of  recurrences  as  those  presented  in 
Fig.  2(a).  In  these  cases,  instead  of  executing  two  load  instructions  for  matrix  B  (for 
position  I  and  I+l,  as  presented  in  Fig.  2(b)),  only  one  load  instruction  is  executed. 
Fig.  2(b)  shows  the  assembly  code  without  vector  first.  Every  add  instruction  involves 
two  vector  load  instructions,  which  is  redundant.  In  Fig.  2(c),  using  vector  first,  the 
same  data  can  be  reused  in  the  loop  body  just  using  the  appropriate  vector  first  value, 
so  just  one  vector  load  is  needed  for  each  add  instruction.  [Note  that  the  notation 
‘''vO'  means  execution  under  vector  first]. 
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(b)  (c) 


Fig.  2.  Typical  vector  loop  at  Hydrold  benchmark,  (a)  Source  code  for  a  vector  loop  with  a 
recurrence  of  distance  1.  (b)  Assembly  code  without  using  vector  first  facility,  with  add  in¬ 
volving  two  load  instructions,  (c)  Assembly  code  using  vector  first  so  that  every  data  must  be 
loaded  just  once 

Table  2  presents  the  distribution  of  the  vector  first  values  for  the  same  Specfp92 
benchmarks  as  Fig.  1.  This  table  shows  the  total  number  of  operations  carried  out 
under  vector  first  and  the  respective  percentages  of  operations  that  have  been  exe¬ 
cuted  with  vector  first  equal  to  1 ,  2  or  other  values.  The  compiler  is  not  able  to  use 
the  vector  first  neither  in  benchmark  Nasal  nor  in  Sulcor.  Moreover,  these  programs 
only  present  low  order  recurrences  (with  distance  lor  2). 

Table  2.  Vector  First  distribution  for  Specfp92  programs 


Program 

#  Ops  under 
VF(x  10") 

VF  Value  (in  percentages) 

1 

2 

Other 

Swm256 

2.841 

76 

24 

0 

Hydro2d 

1 1 .060 

100 

0 

0 

Nasa7 

- 

- 

- 

- 

Su2cor 

- 

- 

- 

- 

Tomcatv 

1.124 

50 

50 

0 

Waves 

1.449 

97 

3  ■ 

0 
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2.4  Vector  Mask  Execution 

The  Convex  C4  vector  processor  allows  the  execution  of  instructions  under  a  calcu¬ 
lated  mask  stored  in  the  Vector  Mask  (VM)  register.  The  VL  operations  will  be  car¬ 
ried  out,  but  only  those  that  have  the  correct  value  stored  in  the  i^^  position  of  the 
mask  will  be  finally  stored  in  the  destination  register  of  the  instruction.  We  have 
made  an  analysis  of  the  masks  used  during  the  execution  of  the  benchmarks  so  to  test 
the  effectiveness  of  masked  execution.  Table  3  shows  the  total  ammount  of  instruc¬ 
tions  executed  under  mask  and  the  percentage  of  instructions  with  respect  to  the  total 
ammount  of  instructions.  This  data  shows  a  relatively  small  use  of  the  execution  un¬ 
der  mask  in  the  C4  vector  processor.  However,  taking  into  account  that  each  vector 
instruction  implies  the  execution  of  VL  operations,  table  3  also  shows  the  total  am¬ 
mount  of  operations  executed  under  mask  and  the  percentage  of  operations  referred  to 
the  total  ammount  of  operations.  From  this  table  we  can  see  that  the  most  intensive 
use  of  the  masked  execution  is  made  by  the  Hydrold  benchmark  with  more  than  15% 
of  their  operations  executed  under  mask.  Programs  5u2cor  and  WaveS  execute  3.95% 
and  3.64%  of  their  operations  under  mask,  respectively.  The  remaining  programs 
execute  either  very  few  operations  under  mask  {Swm256  and  NasaT)  or  none  at  all 
(Toincatv). 

The  execution  of  operations  under  mask  can  be  considered  as  speculative  execu¬ 
tion,  as  all  VL  operations  are  carried  out  but  only  those  that  correspond  to  the  right 
value  in  the  mask  are  used.  We  can  think  of  the  extra  operations  as  misspeculative 
execution.  The  analysis  of  the  masks,  as  we  will  show,  has  allowed  us  to  measure  the 
amount  of  speculative  work  carried  out  by  the  vector  processor. 

Table  3.  Instructions  and  operations  executed  under  vector  mask 


Program 

Instructions  executed  under 

vector  mask 

Operations  executed  under 
vector  mask 

Total  Number 

(X  10'’) 

%  over  total 

instructions 

Total  Number 
(x  10”) 

%  over  total 
operations 

Swm256 

0.01 

0.015 

0.13 

0.016 

Hydro2d 

5.75 

7.75 

582.91 

15.65 

Nasa7 

0.07 

0.036 

8.02 

0.20 

Su2cor 

1.06 

0.63 

130.75 

3.95 

Tomcatv 

0.00 

0.00 

0.00 

0.00 

Waves 

5.17 

0.84 

80.00 

3.64 

3  Scalar  and  Vector  ISA’s  Comparison 


In  this  section  we  present  a  comparison  between  superscalar  and  vector  processors  at 
the  instruction  set  architecture  level.  We  will  look  at  three  different  issues  that  are 
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determined  by  the  instruction  set  being  used  and  by  the  compiler:  number  of  instruc¬ 
tions  executed,  number  of  operations  executed  and  memory  traffic  generated,  the 
distinction  between  instructions  and  operations  is  necessary  because  in  .  the  vector 
architecture,  a  vector  instruction  executes  several  operations  (between  1  and  1 28  in 
our  case). 


3.1  Instructions  Executed 

As  already  mentioned,  vector  instructions  contain  a  high  semantic  content  in  terms  of 
operations  specified.  The  result  is  that,  to  perform  a  given  task,  a  vector  program 
executes  many  fewer  instructions  than  a  scalar  program,  since  the  scalar  program  has 
to  specify  more  address  calculations,  loop  counter  increments  and  branch  computa¬ 
tions  that  are  typically  implicit  in  vector  instructions.  The  net  effect  of  vector  instruc¬ 
tions  is  that,  in  order  to  specify  all  the  computations  required  for  a  certain  program, 
much  less  instructions  are  needed.  Fig.  3(a)  presents  the  total  number  of  instructions 
executed  in  the  Mips  RIOOOO  (using  Mips  IV  Instruction  Set  [12])  and  the  Convex  C4 
machines  for  the  six  benchmark  programs.  In  the  Mips  RIOOOO  case,  we  use  the  val¬ 
ues  of  graduated  instructions  gathered  using  the  hardware  performance  counters.  In 
the  Convex  C4  case  we  use  the  traces  provided  by  Dixie[12].  As  it  can  be  seen,  the 
differences  are  huge.  Obviously,  as  vectorization  degree  decreases,  this  gap  is  dimin¬ 
ished.  Although  several  compiler  optimizations  (loop  unrolling,  for  example)  can  be 
used  to  lower  the  overhead  of  typical  loop  control  instructions  in  superscalar  code, 
vector  instructions  are  inherently  more  expressive.  Having  vector  instructions  allows 
a  loop  to  do  a  task  in  fewer  iterations.  This  implies  fewer  computations  for  address 
calculations  and  loop  control,  as  well  as  less  instructions  dispatched  to  execute  the 
loop  body  itself.  As  a  direct  consequence  of  executing  less  instructions,  the  instruc¬ 
tion  fetch  bandwidth  required,  the  pressure  on  the  fetch  engine  and  the  negative  im¬ 
pact  of  branches  are  all  three  reduced  in  comparison  to  a  superscalar  processor.  Also, 
relatively  simple  control  unit  is  enough  to  dispatch  a  large  number  of  operations  in  a 
single  go,  whereas  the  superscalar  processor  devotes  an  always  increasing  part  of  its 
area  to  manage  out-of-order  execution  and  multiple  issue.  This  simple  control,  in 
turn,  can  potentially  yield  a  faster  clocking  of  the  whole  datapath.  It  is  interesting  to 
note  that  the  ratio  of  number  of  instructions  can  be  larger  than  128.  Consider,  for 
example,  Swni256.  In  vector  mode,  it  requires  102.28  million  instructions  while  in 
superscalar  mode  requires  11466  million  instructions.  If,  on  average,  each  vector 
instruction  performs  93  iterations  then  all  these  vector  instructions  would  be  roughly 
equivalent  to  102.28*93  =  9512  million  superscalar  instructions.  The  difference  be¬ 
tween  9512  and  1 1466  is  the  extra  overhead  that  the  supercalar  machine  has  to  pay 
due  to  the  larger  number  of  loop  iterations  it  performs. 


3.2  Operations  Executed 

Although  the  comparison  in  terms  of  instructions  is  important  from  the  point  of  view 
of  the  pressure  on  the  fetch  engine,  a  more  accurate  comparison  between  the  super- 
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scalar  and  vector  model  comes  from  looking  at  the  total  number  of  operations  per¬ 
formed.  As  already  mentioned  in  the  previous  section,  the  reduction  of  overhead  due 
to  the  semantic  content  of  vector  instructions  should  translate  into  an  smaller  number 
of  operations  executed  in  the  vector  model.  Fig.  3(b)  plots  the  total  number  of  opera¬ 
tions  executed  on  each  platform  for  each  program.  These  data  has  been  gathered  from 
the  internal  performance  counters  of  the  Mips  RIOOO  processor,  and  from  the  traces 
obtained  with  Dixie.  As  expected,  the  total  number  of  operations  in  the  superscalar 
platform  is  greater  than  in  the  vector  machine,  for  all  programs.  The  ratio  of  super¬ 
scalar  operations  to  vector  operations  can  be  favourable  to  the  vector  model  by 
factors  that  go  from  1 .24  up  to  1 .88. 


■  Convex  C4 
a  MipxRIOUX) 


■  Convex  C4 
O  MipxRKHXX) 


(a) 


(b) 


Fig.  3.  Vector  -  Superscalar  ISA  comparison,  (a)  Instructions  executed,  (b)  Operations  exe¬ 
cuted 


3.3  Memory  Traffic 

Another  analysis  that  we  have  carried  out  is  the  study  of  memory  traffic  both  in  vec¬ 
tor  and  superscalar  processors.  Superescalar  processors  have  a  memory  hierarchy  in 
which  data  is  moved  up  and  down  in  terms  of  cache  lines.  Some  of  this  data  is  thrown 
away  from  the  cache  before  it  is  used  so  there  is  an  amount  of  traffic  that  is  not 
strictly  useful.  In  vector  processors,  every  data  item  that  is  brought  from  main  mem¬ 
ory  is  used,  so  there  is  no  useless  traffic  in  vector  processors.  Moreover,  depending 
on  the  data  size  of  the  program  there  will  be  different  behaviours  in  superscalar  proc¬ 
essors.  If  data  fits  in  LI,  there  will  be  almost  no  traffic  between  the  LI  and  the  L2 
caches.  However,  if  data  doesn’t  fit  in  LI  but  fits  in  L2,  there  will  be  a  lot  of  traffic 
between  the  LI  and  L2  caches  because  of  conflicts.  If  data  doesn’t  fit  in  the  L2  cache, 
traffic  will  increase  a  lot  between  the  two  memory  hierarchy  levels.  These  behaviours 
can  be  seen  if  Fig.  4. 
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■  Convex  C4 

■  RHXXK}(Rci!-Ll) 
a  RHXXX)  fLI-L2) 

0  RI(XXXKL2*Mcm) 


Fig.  4.  Vector  -  Superscalar  Memory  traffic  comparison 


4  Speculative  Execution  in  Superscalar  and  Vector  Processors 

In  this  section  we  will  make  a  study  about  speculative  execution  in  superscalar  and 
vector  processors.  Each  architecture  is  able  to  speculatively  execute  instructions, 
although  each  one  in  its  particular  way.  Superscalar  processors  execute  speculatively 
instructions  based  upon  predictions  of  conditional  branches.  Vector  processors  exe¬ 
cute  instructions  under  vector  masks  and  only  those  that  have  the  correct  value  in  the 
mask  are  definitely  stored.  This  section  is  intended  to  study  the  effectiveness  of  the 
speculative  execution  in  both  architectures. 


4.1  Speculation  in  Superscalar  Processors 

The  increase'  in  SS  processors  aggressiveness  regarding  issue  width  and  out  of  order 
execution  has  made  branch  prediction  and  speculative  execution  essential  techniques 
in  taking  advantage  of  processor  capabilities.  When  a  branch  is  reached,  and  the  re¬ 
sult  of  the  condition  evaluation  is  not  known,  a  speculation  of  the  final  result  of  the 
branch  is  made,  so  that  the  execution  continues  along  the  speculated  direction.  When 
the  actual  result  of  the  branch  condition  is  obtained,  the  executed  instructions  are 
validated  if  the  prediction  was  correct,  and  rejected  if  not. 

The  amount  of  misspeculative  instructions  in  the  SS  processor  is  presented  in  Fig. 
5.  This  data  has  been  gathered  using  the  Mips  RIOOOO  performance  internal  counters. 
This  speculative  work  includes  all  types  of  instructions.  As  we  can  see  in  Fig.5(a)  the 
misspeculated  execution  of  instructions  (referred  to  the  total  number  of  issued  in¬ 
structions)  for  the  six  programs  goes  from  14%  to  25%. 

Among  the  misspeculative  work,  the  load/store  misspeculation  is  specially  impor¬ 
tant  because  it  wastes  non-blocking  cache  resources,  bandwidth,  and  can  pollute  the 
cache  (and  memory  hierarchy  in  general)  by  making  data  movements  between  differ- 
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ent  levels  that  won’t  be  used  in  the  future.  Fig.  5(b)  shows  the  load/store  misspecula- 
tion  degree  for  the  benchmarks  with  respect  to  the  total  number  of  load/store  instruc¬ 
tions.  In  some  of  them,  the  misspeculation  percentage  is  as  large  as  40%,  although  the 
mean  value  is  about  15%. 


Fig.  5.  (a)  Misspeculative  execution  in  superscalar  processors,  (b)  Load  misspeculation  in 
superscalar  processors 


4.2  Speculation  in  Vector  Processors 

Vector  processors  are  also  able  to  speculatively  execute  instructions,  but  in  a  different 
way  than  superscalar  processors.  It  is  based  on  the  execution  under  vector  mask. 
When  an  instruction  is  executed  under  vector  mask,  all  the  operations  are  carried  out, 
but  only  those  having  ihe  correct  value  in  the  i**^  position  of  the  vector  mask  is  defi¬ 
nitely  stored  in  the  desiinaiion  register.  We  have  previously  presented  the  values  of 
masked  executions  reterred  lo  the  total  number  of  instructions  and  operations  carried 
out  by  the  programs  However,  as  masked  execution  is  only  carried  out  in  vector 
mode,  a  more  precise  measure  about  the  use  of  masked  execution  is  presented  in  table 
4.  Measures  in  table  4  shi'w  that  the  behaviour  differs  from  one  program  to  another. 
Program  Hydrold  excsuics  a  considerable  ammount  of  operations  under  mask  (16%). 
Swin256  and  Nasal  make  almost  no  use  of  the  execution  under  mask  and  finally, 
Sulcor  and  WaveS  execute  4  ;  and  4.95%  of  their  operations  under  mask. 

An  interesting  anaixsis  m.k-pendent  from  the  use  of  masked  execution,  is  the  ef¬ 
fectiveness  of  masked  exes ui ton.  All  these  instructions  executed  under  mask,  are 
properly  speculated  or  noi  '  An  operation  is  speculated  "right"  if  after  the  operation 
has  been  carried  out  the  result  is  effectively  stored  in  its  destination.  All  those  opera¬ 
tion  that  were  carried  out  hui  not  stored  are  misspeculated  work.  Fig.  6(a)  shows  the 
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distribution  of  right  and  wrong  speculated  operations  in  the  five  programs  (recall  that 
program  Tomcatv  does  not  execute  instructions  speculatively).  Three  of  the  programs 
{Nasa7,  Su2cor  and  WaveS)  have  good  values  of  right  prediction;  Nasa7  and  Wave5 
are  above  63%  of  right  speculation  and  Sulcor  is  more  than  56%.  The  other  two 
programs  {Swm256  and  Hydro2d)  have  low  values  of  right  speculation,  with 
Swim256  being  the  program  with  the  worst  behaviour  (only  2.58%  of  right  specula¬ 
tion). 

Table  4.  Instructions  and  operations  executed  under  vector  mask 


Program 

Instmctions  executed  under 

vector  mask 

Operations  executed  under 
vector  mask 

Total  Number 
(x  10") 

%  over  total 

vector  in¬ 
structions 

Total  Number 
(X  10") 

%  over  total 

vector  opera¬ 
tions 

Swm256 

0.01 

0.002 

0.13 

0.016 

Hydro2d 

5.75 

16.25 

582.91 

16.00 

Nasa? 

0.07 

0.12 

8.02 

0.20 

Su2cor 

1.06 

4.41 

130.75 

4.23 

Wave  5 

5.17 

14.40 

80.00 

4.95 

Another  interesting  consideration  that  we  have  studied  regards  the  distribution  of 
operations  executed  under  mask  among  the  different  instruction  types.  This  study  has 
allowed  us  to  establish  the  ammount  of  instructions  executed  under  mask  for  each 
type  of  instructions.  We  have  considered  six  types  of  instructions:  add-like,  mul-like, 
div,  diadic,  load  and  store. 


Fig.  6.  (a)  Distribution  ot  Ri|:h<  Wrong  speculation  operation,  (b)  Distribution  of  instruction 
executed  under  vector  mask  amon^  the  different  instruction  types 
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The  first  consideration  comes  from  the  fact  that  none  of  the  programs  execute  load 
instructions  under  mask,  which  may  be  explained  because  of  the  possibility  of  gather 
instructions.  Fig  6(b)  shows  the  breakdown  of  instructions  executed  under  mask 
among  the  different  instruction  types.  Division  and  add-like  instructions  are  the  most 
used  instructions  for  execution  under  mask. 

Finally,  we  have  also  studied  the  effectiveness  of  execution  under  mask  among  the 
different  types  of  instructions.  Results  in  Fig.  7  show  that,  in  general,  there  is  not  a 
clear  correlation  between  the  instruction  type  and  the  misspeculation  rate.  Division 
instructions  are  an  exception.  For  divisions  the  misspeculation  rates  are  higher  than 
for  the  rest  of  instruction  in  all  cases.  This  result  is  not  unreasonable  since  division 
instructions  are  typically  executed  in  statements  such  as  the  following, 

if  A(i)<>  0  then  B ( i ) =B { i) /A ( i ) 

In  such  a  case,  misspeculation  is  determined  by  the  value  stored  in  A(i).  In  our  programs,  the 
A(i)  vector  is  sparsely  populated  and  causes  large  numbers  of  misspeculations. 


Fig.  7.  Break-down  of  Right  -  Wrong  speculated  operations 


5  Conclusions 

We  have  outlined  a  comparison  between  superscalar  and  vector  processors  from  sev¬ 
eral  points  of  view.  Vector  processors  have  different  possibilities  that  allow  them  to 
decrease  the  memory  traffic  and  branch  impact  in  a  program’s  execution.  Their  SIMD 
model  is  especially  interesting  because  the  initial  latency  of  the  operations  is  amor¬ 
tized  across  the  VL  operations  that  each  instruction  executes. 

We  have  studied  the  behaviour  at  the  ISA  level  of  the  superscalar  and  vector  proc¬ 
essors.  We  looked  at  total  number  of  instructions  executed,  number  of  operations 
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executed  and  memory  traffic.  The  vector  processor  executes  much  less  instructions 
than  the  superscalar  machine  due  to  the  higher  semantic  content  of  its  instructions. 
This  translates  into  a  lower  pressure  on  the  fetch  engine  and  the  branch  unit.  Moreo¬ 
ver,  the  vector  model  executes  less  operations  than  the  superscalar  machine.  The 
analysis  of  memory  traffic  reveals  that,  in  general,  and  ignoring  spill  code  effects,  the 
vector  machine  performs  less  data  movements  than  the  superscalar  machine. 

We  have  also  studied  the  speculative  execution  behaviour  of  superscalar  and  vec¬ 
tor  processors.  Superscalar  processors  make  an  extensive  use  of  speculative  execution 
and  the  misspeculation  rates  are  important.  On  the  other  hand,  vector  processors  exe¬ 
cute  speculatively  by  using  the  vector  mask.  Vector  processors  make  a  lower  use  of 
execution  under  mask  and  the  misspeculation  rates  are  also  important,  although  many 
of  them  are  produced  because  of  prediction  in  div  instructions. 
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Abstract  This  paper  presents  JWarp,  a  Java  library  that  implements  an 
optimistic  model  of  discrete-event  parallel  simulation:  the  Time-Warp  model. 
Java  fits  well  in  the  field  of  simulation  and  offers  some  important  advantages 
over  other  languages;  modularity,  flexibility,  robustness,  support  for 
multithreading  and  exception  handling.  The  paper  presents  the  main  features  of 
the  library,  the  programming  interface  and  some  of  its  implementation  details. 
JWarp  is  one  of  the  first  libraries  to  implement  Time-Warp  in  Java. 


1.  Introduction 

There  are  several  areas  like  engineering,  computer  science,  economics  and  military 
that  are  particularly  interested  in  using  simulation  to  study  the  behaviour  of  complex 
models.  The  execution  of  some  of  those  simulation  models  can  be  a  very  time 
consuming  task.  For  statistical  reasons  it  might  be  necessary  to  simulate  a  model  for 
quite  a  long  time,  or  to  perform  the  same  simulation  several  times  with  different 
parameter  values. 

A  possible  solution  to  reduce  the  execution  times  of  long-running  simulations  is  by 
using  multiple  processors  operating  in  parallel  [Fujimoto90].  A  typical  simulation 
model  involves  several  components  or  entities.  By  exploiting  this  inherent  model  of 
parallelism  it  would  be  possible  to  speed  up  the  performance  of  the  simulations  by 
decomposing  these  components  through  several  processors. 

Every  simulation  model  is  a  specification  of  the  corresponding  physical  model  and 
is  composed  by  a  set  of  slates  and  events.  In  a  discrete  event  simulation  the  state  of 
the  system  only  changes  at  discrete  points  in  simulated  time. 

A  natural  decomposition  strategy  can  result  in  an  object-oriented  system  design, 
where  an  object  corresponds  to  some  component  of  the  real  system  and  is  represented 
by  a  computational  task  that  is  assigned  to  a  processor  for  execution.  In  this  way,  a 
logical  process  (LP)  simulates  every  component  of  the  model.  A  discrete-event 
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simulation  requires  the  existence  of  multiple  LP  entities,  a  time-ordered  event  list 
holding  timestamped  events  to  be  processed  in  the  future,  a  global  discrete  clock  that 
indicates  the  current  simulation  time  and  a  set  of  state  variables  that  define  the  state  of 
the  simulation. 

The  most  simple  way  for  managing  the  event-list  would  be  based  on  a  centralized 
strategy;  the  list  of  events  is  managed  by  a  single  process  (master),  and  there  would 
be  a  pool  of  slave  processes  running  on  the  parallel  system  that  would  execute  those 
events  in  a  concurrent  way.  However,  the  existence  of  a  centralized  queue  of  events 
would  represent  a  bottleneck  to  the  simulation  thereby  clearly  reducing  the  potential 
for  parallelism. 

The  most  permissive  way  of  conducting  parallel  simulations  is  to  eliminate  the 
globally  shared-event  list  and  use  a  completely  distributed  list  of  events.  Each  LP  will 
be  assigned  to  a  processor  that  maintains  its  own  local  simulation  clock,  a  local  event 
list  and  a  set  of  state  variables.  Events  are  modeled  as  timestamped  messages,  which 
are  exchanged  between  the  physical  objects  of  the  application  (LP). 

However,  the  schemes  that  follow  a  distributed  strategy  would  require  some 
synchronization  protocols  to  make  sure  the  events  are  processed  in  a  consistent  order 
by  ail  the  LP  entities.  These  synchronization  protocols  may  increase  the  costs  of 
communication  between  processors.  Nevertheless,  they  have  been  deserved  a 
considerable  attention  by  the  parallel  simulation  research  community  [Lin95]. 

In  order  to  understand  the  main  issue  behind  the  use  of  distributed  event-lists  lets 
take  a  look  at  Figure  1 .  It  represents  the  temporal  execution  of  two  logical  processes 
(LPl  and  LP2).  The  LPl  entity  is  processing  event  alpha 
while  LP2  is  processing  event  beta.  The  execution  of 
event  alpha  generates  a  new  event  (Gama)  that  is  sent  to 
LP2.  This  Gama  event  has  a  lower  timestamp  than  event 
beta,  and  thus  should  have  been  consumed  before  that 
one.  Due  to  the  asynchrony  of  the  LP  entities  it  was  not 
possible  to  assure  a  consistent  order  in  the  processing  of 
events,  thereby  resulting  in  a  causality  error 
[FujimotoQO]. 


The  synchronization  protocols  have  been  broadly  classified  as  Conservative  or 
Optimistic  [Reynolds88].  Both  schemes  are  based  on  the  sending  of  messages 
carrying  some  causality  information. 

The  Conservative  approach  [Chandy79]  strictly  avoids  the  possibility  of  any 
causality  error  ever  occurring.  This  is  achieved  by  stopping  each  process  until  the 
system  is  sure  that  no  other  event  will  be  scheduled  by  any  other  LP  with  a  timestamp 
smaller  than  the  one  in  the  top  of  the  local  list  of  events.  This  method  introduces  some 
blocking  on  the  execution  of  processes  and  restricts  the  potential  for  parallelism. 
Besides  it  is  prone  to  the  occurrence  of  deadlock  and  thus  requires  a  deadlock 
detection  and  recovery  scheme. 

The  Optimistic  approach  tries  to  exploit  all  the  potential  parallelism  available  in 
the  simulations.  The  Time  Warp  mechanism  is  a  well  known  optimistic  approach 


LPl  time 
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LP1  event  list 


LP2  time 
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Fig.  1.  The  causality  problem 
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based  upon  the  Virtual  Time  paradigm  [Jefferson85].  It  relies  upon  a  scheme  for 
causality  error  detection  and  a  recovery  scheme  based  on  a  rollback  technique.  An 
optimistic  LP  progresses  simulation  and  advances  its  local  virtual  time  as  far  as 
simulation  is  possible  without  occurring  any  causality  error. 

If  an  event  is  scheduled  in  some  LP  with  a  timestamp  in  the  local  past  relative  to 
the  local  virtual  clock,  i.e.  out  of  chronological  order  {straggler  message),  then  the  LP 
entity  is  forced  to  roll  back  to  the  most  recently  saved  state  in  the  simulation  history 
consistent  with  the  arrival  of  that  event  message  and  restarts  the  simulation  at  that 
point  thereby  correcting  the  causality  error. 

In  order  to  allow  this  rollback  operation  every  LP  entity  is  forced  to  save  its 
simulation  state  from  time  to  time.  All  the  messages  that  were  sent  previously  after 
that  instant  of  time  should  be  undone.  This  is  achieved  by  sending  .some  sort  of  anti¬ 
messages  to  annihilate  the  original  messages.  If  these  ones  were  already  consumed  by 
the  destination  processes  they  will  be  force  to  roll  back  as  well  to  a  previous  saved 
state.  It  was  proved  by  [Jefferson85]  that  the  protocol  will  not  roll  back  until  the 
beginning  of  the  simulation  and  always  assures  some  forward  progress  for  the 
computation. 

Anti-messages  (also  called  negative  messages)  are  exact  copies  of  normal 
(positive)  messages  with  a  single  difference:  they  have  the  sign  field  with  a 
different  value.  When  a  process  sends  an  anti-message  it  passes  part  of  its 
responsibility  of  rollback  to  the  other  process.  The  other  may  or  may  not  rollback 
depending  of  its  internal  state:  if  the  message  corresponding  to  the  anti-message  was 
already  consumed  then  it  must  rollback. 

The  major  drawback  of  the  Time  Warp  approach  is  the  need  to  save  each  process 
state  periodically  [Jefferson87].  To  free  up  some  of  the  used  memory  the  simulation 
system  calculates  a  time  limit,  called  Global  Virtual  Time  (GVT)  [Belenot90]  beyond 
which  no  process  is  required  to  roll  back  and  thereby  the  system  can  perform  some 
garbage  collection  scheme.  Alternative  solutions  are  also  required  to  optimize  the 
rollback  operation  [Gafni88]  and  to  achieve  load-balanced  simulations  [Das94]. 

Time  Warp  is  a  relatively  complex  simulation  protocol  but  it  has  been  proved  to  be 
a  very  effective  technique  for  running  complex  asynchronous  simulations 
[Wieland89][Presley89].  We  foresee  that  with  an  implementation  in  Java  the  use  of 
Time  Warp  could  become  more  widespread  for  use  by  the  research  community  as 
well  as  for  educational  purposes. 


2.  The  Importance  of  Java 


In  the  past  few  years,  Java  as  received  a  great  deal  of  attention  from  several  fields  of 
computing  including  network  and  distributed  programming. 

A  comprehensive  list  of  computing  platforms  has  been  enhanced  with  the  support 
of  Java  Virtual  Machine  (JVM)(Oasis].  Since  Java  programs  are  entirely  portable 
across  the  systems  that  have  a  JVM  we  will  be  able  to  execute  parallel  simulations  in 
heterogeneous  systems,  comprising  networks  of  personal  computers  running  a 
Microsoft  Windows  operating  system  or  clusters  of  workstation  machines  running 
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some  flavor  of  Unix.  All  this  will  be  possible  with  a  simulation  tool  like  JWarp. 
Programmers  are  not  required  to  change  any  line  of  code  of  their  simulations  since 
Java  provides  the  necessary  support  to  deal  with  the  heterogeneity. 

The  main  handicap  of  Java  is  still  its  poor  performance.  However,  recent  studies, 
have  proved  that  the  use  of  JIT,  Java  as  enhanced  its  performance  to  the  C++  level 
[Mangione98].  With  the  foreseed  improvement  in  the  JVMs  available,  Java  will  dose 
the  performance  gap  even  more  in  the  near  future. 


3.  JWarp  Internal  Architecture 


Figure  2  represents  the  JWarp 
internal  architecture.  In  this 
Figure,  the  ovals  represent 
threads,  the  rectangles  represent 
data  buffers  and  full  lines 
represent  data  transfers.  In  this 
first  approach,  only  positive 
messages  (thin  lines)  and  state 
saving  and  restoring  (thick  lines) 
are  represented.  It  will  be  shown 
latter  all  the  other  kinds  of 
message  flows. 

Events  arrive  to  every  LP  by 
being  first  received  in  cs2ib, 
placed  in  IB,  received  in  ib2ig 
and  placed  in  iq.  Outgoing 
events  are  placed  in  OQ  by  LP, 
received  by  oq2ob,  placed  in  OB, 
received  by  ob2cs  and  sent  into 
the  network. 


Fig.  2.  JWarp  architecture 


LP  state  variables  (defined  by  the  programmer)  are  saved  from  time  to  time  in  the 
State  Stack  (ss). 

The  threads  cs2ib,  oq2ob  and  ob2cs  are  just  running  an  infinite  cycle  fetching 
data  from  one  side  and  placing  it  in  the  other.  Thread  ob2cs  analyses  one  field  of  the 
messages  to  know  where  to  send  them  over  the  network. 

Thread  ib2iq  detects  the  messages  out-of-order  and  causality  errors.  It  will 
command  the  state  restoring,  anti-message  sending,  and  will  process  GVT  calculation 
requests. 

If  there  were  no  straggler  messages  the  JWarp  internal  behaviour  would  be  the 
following: 

1.  The  message  arrives  at  cs2ib  from  the  network  through  TCP/IP. 
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2.  The  message  is  placed  in  IB  in  arriving  order. 

3.  It  is  fetched  by  ib2iq. 

4.  A  corresponding  acknowledge  message  is  put  in  OB  by  ib2iq. 

5.  The  acknowledge  message  is  sent  by  ob2cs. 

6.  ib2iq  puts  the  received  message  in  IQ  ordered  by  the  simulation  time. 

7.  Depending  on  the  checkpoint  frequency,  the  lp’s  state  is  saved  in  ss. 

8.  Just  after  the  state  saving  the  message  finally  arrives  to  LP.  LVT  is  updated  to  a 


new  value  that  corresponds 
to  the  incoming  message 
processing  time. 

9.  LP  consumes  the  message 
and  responds  by  sending 
none,  one  or  more  messages, 
to  one  or  more  recipients, 
that  are  placed  in  the 
Output  Queue  in  arriving 
order. 

10. The  messages  are  then 
fetched  from  OQ  and  placed 
in  OB  by  oq2ob. 

11.  They  are  finally  sent  over 
the  network  if  they  are 
remote  events  or  placed  in 
IQ  if  they  are  local  events. 


Fig.  3.  Buffer  behaviour 


3.1.  Buffers 

In  JWarp,  when  a  buffer  is  asked  to  retrieve  the  next  event  it  can  do  one  of  two  things: 
i)  retrieve,  return  and  delete  the  message  or  ii)  just  retrieve  and  return.  Buffers  IB 
and  OB  delete  retrieved  messages  while  IQ  and  OQ  do  not.  Events  are  kept  and  not 
deleted  in  IQ  and  OQ  because  when  there  is  a  rollback  operation  those  events  must 
be  consumed  again.  Likewise,  the  events  that  were  sent  must  be  maintained  because 
there  could  be  a  potential  need  to  send  anti-messages.  Fetching  an  event  in  IQ  or  OQ 
only  means  to  retrieve  a  copy  of  it  and  move  LVT  pointer  forward. 

Although  the  pointers  are  called  LVT  and  GVT  they  do  not  store  LVT  and  GVT 
time  values.  They  are  just  a  reference  in  the  array  buffers.  Buffers  IB  and  OB  do  not 
need  to  keep  any  of  its  messages.  All  the  information  needed  for  a  rollback  is  stored 
in  IQ,  OQ  and  ss  between  the  GVT  and  LVT  pointers. 

IQ  -  Input  Queue 

In  IQ,  the  events  after  GVT  pointer  are  the  ones  that  have  simulation  time  bigger  than 
GVT  time.  Events  after  LVT  pointer  are  the  ones  that  have  simulation  time  bigger 
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than  LVT  time.  Thus  events  after  lvt  pointer  have  not  been  processed  yet  and  the 
ones  between  LVT  and  GVT  pointers  have  been  processed  but  can  not  be  discarded 
because  they  might  be  needed  in  a  rollback  situation. 

Events  in  IQ  are  placed  in  increasing  simulation  time  order.  The  fetched  event  is 
always  the  one  with  (lvt  pointer) +1. 


OQ  -  Output  Queue 

In  OQ,  the  events  after  LVT  pointer  have  not  been  sent  already,  and  the  ones  between 
GVT  and  LVT  pointers  have  been  sent  but  can  not  be  deleted  because  they  might  be 
needed  for  anti-messaging.  Note  that  IQ’s  LVT  pointer  is  directly  related  with  lvt 
value:  it  defines  a  frontier  splitting  events  with  simulation  time  smaller  than  lvt 
from  those  with  simulation  time  bigger  than  lvt.  However,  in  OQ,  there  is  no  such 
relationship.  LVT  pointer  is  just  a  frontier  splitting  sent  and  unsent  events.  This 
means  that  events  in  OQ  are  sent  as  soon  as  possible  even  if  they  have  simulation 
times  much  bigger  than  LVT.  Events  in  OQ  are  placed  and  retrieved  in  FIFO  order. 

SS  •  State  Stack 

States  are  saved  from  time  to  time  and  placed  in  ss  in  FIFO  order.  There  are  no  state 
records  above  lvt  pointer  or  below  GVT  pointer  as  it  can  be  seen  in  Figure  3. 

IB  -  Input  Buffer  &  OB  -  Output  Buffer 

Events  are  put  and  get  in  FIFO  order.  When  an  event  is  get  from  IB  or  OB  it  is 
removed  from  there. 

3.2.  Threads 

After  the  initial  synchronization  phase  there  will  be  the  following  threads:  cs2ib, 
ib2iq,  LP,  oq2ob,  obics  and  GVTmaster. 

cs2ib  -  From  Commoaicatioa  System  to  Input  Buffer 

It  is  only  listening  for  iiKuming  messages.  It  will  receive  every  kind  of  message 
(normal  events,  acknowtedfe  messages,  GVT  start  request  and  GVT  broadcasts) 
and  will  treat  them  all  •uh  ihe  \ame  procedure:  place  incoming  message  in  IB. 

oq2ob  •  From  Output  Qmtme  to  Output  Buffer 

Runs  an  infinite  loop  tcu  hing  (rom  OQ  and  putting  in  OB.  This  operation,  updates 
automatically  OQ’s  LVT 

ob2cs  •  From  Output  Buffer  to  Communication  System 

This  thread  fetches  messages  from  OB  and  if  the  message  is  normal  message  or  it  is 
an  anti-message,  an  acknowledge  message,  or  a  GVT  report  message,  it  will  peek 
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into  its  receiver  ID  field  and  send  it  there.  If  the  message  is  a  GVT  start  or  gvt 
broadcast  message  it  will  send  the  message  to  every  possible  LP  over  the  network. 

ib2iq  -  From  Input  Buffer  to  Input  Queue 

Upon  receiving  a  message  it  acts  as  follows: 

A.  If  it  receives  a  normal  message  the  message  is  placed  in  IB  in  simulation  order, 
ready  for  being  processed  by  LP.  A  corresponding  acknowledge  message  is 
immediately  sent  to  OB. 

A.l.  If  when  trying  to  place  the  message  it  realizes  there  was  a  causality  error,  it 
initiates  the  rollback  operation.  The  message  is  still  placed  in  IB  in 
simulation  order  regardless  of  the  rollback. 

A. 2.  If  there  was  a  negative  counterpart  in  queue  then  both  messages  are 

annihilated  and  no  rollback  will  happen  even  if  the  negative  message  had 
past  simulation  time. 

B.  If  is  an  anti-message: 

B.  1 .  With  a  corresponding  positive  (normal)  message  in  IB  it  annihilates  both. 

B.  1.1.  If  the  positive  message  was  already  consumed  then  it  starts  the 
rollback  operation. 

B.2.  Without  a  corresponding  positive  counterpart  then  it  is  just  placed  in  IQ. 

C.  If  it  is  an  acknowledge  message  it  will  search  for  the  corresponding 
unacknowledged  message  in  OQ  and  will  set  its  status  to  acknowledged. 

D.  If  it  is  a  GVT  start  message  it  will  start  GVT  calculation  algorithm  which 
finishes  by  sending  a  gvt  report  message  to  OQ  and  from  there  to  the 
initiation  GVT  calculation  process. 

E.  If  it  is  a  GVT  broadcast  message,  the  new  GVT  will  be  updated  accordingly 
and  the  garbage  collection  will  take  place. 


LP  -  Logical  Process 


The  programmer’s  thread  is  completely  unaware 
of  negative  and  positive  message  differences, 

GVT  Start,  GVT  report,  GVT  broadcast 

and  acknowledge  messages.  All  messages 
received  by  LP  are  positive  and  therefore  are 
treated  in  the  same  way.  They  are  fetched  from 
IQ  and  the  rest  is  up  to  the  programmer.  When 
processing  the  event,  none,  one  or  more  events 
may  be  produced  and  then  placed  in  OQ. 

Fig.  4.  Thread  layers 


GVT  Master 

This  thread  only  exists  in  one  process.  From  time  to  time  it  wakes  up  and  initiates  the 
GVT  calculation  mechanism  by  sending  a  GVT  start  message  into  buffer  OB.  On 
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the  other  side  of  the  buffer,  thread  ob2cs  will  fetch  the  message;  it  will  see  that  it  is 
a  message  to  broadcast  and  does  so. 

Layer  Relations  Between  Threads 

All  JWarp  threads  and  its  communication  channels  (buffers)  are  represented  in  Figure 
4. 

In  the  Time  Warp  model  every  message  has  at  least  four  fields;  Sender  ID, 
Receiver  ID,  Sender  LVT,  Receiver  LVT.  Receiver  LVT  is  also  called 
simulation  time,  since  the  message  will  be  simulated  at  that  particular  time. 


3.3.  Types  of  Messages 


Negahve  Events 
GVT  start  and  report 
GVT  broadcast 
r>  Actcnowtedgements 


oi— Rollback  Actions  (due  out-ef-erder  or  negative  event) 

^  GVT  internal  calculation  phase  (due  to  GVT  start  message) 
Garbage  Colecfion  (due  to  GVT  broadcast) 

Set  meMsge  status  to  acknowledge  (due  to  ack.  message) 


Fig.  5.  Message  flows  -  represents  every  kind  of  possible  message  and  its  consequences. 
Messages  are  represented  with  full  lines  and  actions  with  dotted  lines.  Message  and  action 
arrows  of  the  same  style  are  cause  and  consequence. 


Positive  Messages 

As  it  can  be  seen,  only  positive  messages  arrive  to  the  LP.  Just  before  arriving,  the 
LP’s  state  is  saved  in  ss,  as  it  can  be  seen  in  Figure  5  by  the  dotted  State  Saving  line. 
After  arriving  to  the  LP,  this  message  is  processed  and  eventually  some  more 
messages  are  produced  and  sent  to  the  network.  However,  if  a  positive  message  is 
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timestamped  in  the  past,  a  rollback  will  happen.  In  a  rollback  operation,  the  state 
variables  are  restored  from  ss,  the  LVT  pointer  in  IQ  is  adjusted  to  this  state  and  the 
LVT  pointer  in  OQ  is  also  adjusted.  In  OQ  the  messages  that  where  to  the  left  of  LVT 
pointer  and  are  now  to  the  right  must  be  unsent.  For  every  message  in  these 
conditions,  a  correspondent  anti-message  is  created  and  is  sent  while  the  original  to 
unsent  message  is  deleted.  All  messages  to  the  right  of  LVT  pointer  before  adjustment 
are  just  deleted. 

Negative  Messages 

If  the  incoming  message  is  negative  it  will  never  get  to  the  LP.  Two  things  may 
happen:  it  creates  a  rollback  or  it  does  not  produce  a  rollback.  Other  Time  Warp 
models  allow  for  a  negative  message  to  arrive  before  its  positive  part  if  the  underlying 
communication  system  allows  for  out-of-order  messages.  JWarp  uses  TCP  sockets 
thus  this  is  guaranteed  never  to  happen.  However,  if  it  is  possible  for  a  negative 
message  to  arrive  before  the  positive,  then  what  is  needed  to  do  is  simply  to  place  it  in 
IQ  and  do  not  allow  LP  to  fetch  it.  Whenever  the  positive  message  arrived  both 
would  be  annihilated  in  the  buffers. 

Acknowledge  Messages 

When  a  positive  message  arrives  to  ib2iq  an  acknowledge  message  is  produced  and 
placed  in  OB.  When  an  acknowledge  message  arrives,  ib2iq  will  look  for  its 
corresponding  message  in  OQ  and  change  its  status  to  “acknowledged”. 

GVT  Start  and  GVT  Report  Messages 

When  a  GVT  start  message  arrives,  OQ  is  consulted  (GVT  Internal  Calculation 
Phase  line)  in  order  to  obtain  the  proper  values  to  respond  with  a  GVT  report 
message.  That  message  is  then  sent  to  back  to  the  master. 

GVT  Broadcast 

Finally,  when  a  GVT  broadcast  message  arrives  with  a  new  GVT  an  operation  of 
garbage  collection  is  started  which  involves  removing  some  data  from  IQ,  OQ  and  ss. 


4.  JWarp  Interface 

Like  many  simulation  languages  and  environments,  the  JWarp  library  offers  a  event 
list  and  functions  to  fetch  and  schedule  events.  Applications  built  with  JWarp  should 
typically  run  in  a  cycle  fetching  one  event  at  a  time  from  the  event  list  and  processing 
that  event.  The  event  processing  operations  may  produce  zero,  one  or  more  events 
either  to  be  handled  by  the  local  processor  or  by  a  remote  one. 

To  allow  rollback  operations,  the  state  variables  need  to  be  saved  periodically. 
JWarp  offers  special  classes  where  the  programmer  is  allowed  to  define  which 
variables  (or  objects)  make  part  of  the  application  state  and,  therefore,  which 
variables  have  their  values  restored  after  a  rollback. 
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At  the  programming  stage,  the  developer  is  asked  to  define  the  event  types  that  is, 
the  messages  to  be  exchanged  between  processors  at  run-time.  The  programmer  must 
also  define  which  machines  and  ports  will  be  used  in  the  distributed  simulation.  The 
pool  of  processors  is  therefore  static;  removing,  adding  or  changing  any  of  these 
entries  implies  a  new  compilation  of  the  package. 

The  interface  used  by  the  application  consists  in  only  a  few  functions  to  retrieve 
events  and  to  schedule  events.  Network  communications,  location  of  other  process, 
the  operations  of  rollback,  state  saving  and  state  restoring  are  completely  invisible  to 
the  application. 


4.1.  JWarp  Programming  Example 

Let  us  see  through  a  small  example  of  a  Ping-Pong  application  how  to  program  a 
JWarp  simulation.  This  example  is  quite  simple:  one  process  sends  an  event  message 
and  the  other  replies  with  another  event.  Figure  6  presents  the  main  Java  file 
(PingPong.  java)  that  specifies  the  LP  entities  and  indicates  the  mapping  of  events 
to  the  corresponding  LPs. 

package  pingpong; 
import  jwarp.*; 

class  PingPongt 

static  JwarpManager  sim  =  new  JwarpManager ( ) ; 

public  static  void  main  (String  args[]){ 

Ping  pPing  =  new  Ping ("I  process  ping  events"); 

Pong  pPong  =  new  Pong (“I  process  pong  events"); 

sim.mapsEvent2LP( "ping" ,  pPing) ; 
sim. mapsEvent2LP ( "pong" ,  pPong) ; 

sim.  JWInit  (args)  ; 

) 

} 

Fig.  6.  The  main  class  that  starts  the  whole  simulation  (PingPong.java) 

The  things  that  are  required  to  do  are: 

1.  First,  create  a  class  with  a  public  static  main  method.  In  this  example,  this 
tile  is  PingPong.java. 

2.  Define  an  JwarpManager  object  that  will  be  responsible  by  the  control  of  the 
simulation  (line  5). 

3.  Declare  our  LP  entities:  Ping  and  Pong  (lines  9  and  10). 
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4.  Declare  which  events  are  handled  by  the  Logical  Processes  by  using  the  method 
mapsEvent2LP  (lines  12  and  13). 

5.  Start  everything  with  JWinit  (args)  (line  15). 

The  main  ( )  class  and  the  help  of  a  configuration  class  are  used  by  the  JWarp 
package  to  know  which  processes  should  run  on  which  processors,  which  LPs  should 
be  executed  and  what  is  the  mapping  of  events  to  LP  entities. 

Figure  7  shows  the  code  of  one  the  LP  entities  (Ping).  The  other  one  (Pong)  is  not 
shown  since  it  is  quite  similar.  Mainly  there  are  two  things  that  are  required  for  a 
programmer  to  do: 

1.  Define  the  LP  entities  which  will  make  part  of  the  simulation.  In  this  particular 
case  they  are  defined  in  Ping,  java  and  Pong.  java.  These  classes  are  extensions 
to  a  JWarp  abstract  class:  Jwarp_LP.  Since  this  class  implements  Runnable  the 
programmer  must  define  its  code  inside  the  run  method.  These  classes  are  the  ones 
which  define  the  model  to  be  simulated; 

2.  Receiving  and  sending  messages  is  accomplished  with  the  getEvent  and 
putEvent  methods. 

package  pingpong; 
import  jwarp.*; 

public  class  Ping  extends  Jwarp_LP{ 
public  void  run { ) { 
ppMessage  pingOut; 

pingOut  =  new  ppMessage(  2,  5,  "ping“,  "pong”,  “Hi  from 
Ping !"); 

putEvent (pingOut) ; 

System. out .println ( "Ping  sent  message:  "  +  pingOut); 
ppMessage  pongin  =  (ppMessage)  getEventO; 

System. out. println ( "Pong  received  message:  "  +  pongin); 

} 

public  Ping (String  name)  {  super (name) ; } 

} 

Fig.  7.  The  code  of  one  LP  entity  that  sends  a  ping  event  and  receives  a  pong  (Pong .  j  ava) 

Finally  Figure  8  presents  the  definition  of  an  event  message.  The  programmer  is 
basically  required  to: 

1.  Define  the  message  types  necessary  to  the  simulation.  In  this  case  we  define  a 
single  one  that  must  extend  the  class  Message,  a  class  belonging  to  the  JWarp 
package; 

2.  To  print  the  message  contents  the  programmer  may  define  the  toString  method. 
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package  pingpong; 
import  jwarp.*; 

public  class  ppMessage  extends  Message  { 

String  sentence; 

public  ppMessage (  long  sendingTime,  long  receivingTime, 
String  sender  ,  String  receiver, 
String  sentence) { 

super (sendingTime,  receivingTime,  sender,  receiver); 
this . sentence  =  sentence; 

} 

public  String  toString(){ 

return  ("ppMessage  "  +  this . getSendingTime ( )  + 

+  this . getReceivingTime ( )  + 
From:  "  +  this.getSender  ( )  + 

To:  "  +  this.getReceiverO  )  ; 

) 

} 

Fig.  8.  Definition  of  an  event  message  (ppMessage  .java) 


5.  Related  Work 

Several  work  about  the  Time  Warp  model  has  been  presented  in  the  literature 
[Jefferson85][Fujimoto90][Lin95][Ferscha95].  It  was  firstly  implemented  as  an 
operating  system  -  TWOS  -  in  the  Jet  Propulsion  Laboratory  (Jefferson 87].  Later  on, 
it  was  ported  to  several  other  systems  [Fujimoto89][Turner94][Belenot92]. 

Several  parallel  simulation  languages  have  also  appeared  in  the  last  decade:  OLPS 
[Abrams88],  Maisie  [Bagrodia90],  ModSim  [West88],  SCE  [Gill89],  Sim++ 
[Baezer94]  and  YADDES  [Preiss89]. 

Other  approach  has  been  followed  by  other  researchers  that  chose  to  implement  the 
parallel  simulation  system  as  a  run-time  library  written  in  C++:  examples  include 
WARPED  [Martin94],  SPEEDES  [Steinman91]  and  HASE++  [Howell97]. 

Until  recently,  there  only  two  simulation  libraries  that  were  implemented  in  Java: 
SimJava  [SimJava]  and  SimKit  [SimKitj.  However,  these  libraries  only  support 
sequential  simulations.  This  year  parallel  discrete-event  simulation  Java  libraries 
appeared:  JTED  [Cowie98]  following  the  conservative  approach  and  Formax 
[Halderen98]  following  a  web-based  optimistic  approach. 
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6.  Conclusions 

This  paper  reports  an  implementation  of  the  Time  Warp  model  in  Java.  The  library 
implements  all  the  internal  synchronization  mechanisms  included  in  that  model  and 
provides  a  very  easy-to-use  programming  interface. 

With  JWarp  it  can  be  possible  to  execute  parallel  applications  on  clusters  of 
workstations  and  personal  computers  that  have  the  support  of  a  Java  Virtual  Machine. 
Java  assures  the  portability  of  the  programs,  solves  the  problems  of  heterogeneity  and 
provides  a  quite  flexible  programming  environment. 

It  can  be  used  to  execute  long-running  complex  simulation  models.  With  the 
appropriate  visualization  tools  it  can  also  be  adopted  in  the  class  rooms  for  the 
teaching  of  parallel  simulation  techniques  and  concurrent  programming. 
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Abstract.  The  PHAROS  project,  funded  by  the  European  Unions  ESPRIT  pro¬ 
gram  for  research  and  development  in  information  technology,  aimed  to  assess 
High  Performance  Fortran  (HPF)  as  a  paradigm  for  porting  large  FORTRAN  77 
scientific  applications  to  distributed  memory  architectures,  in  comparison  to 
message-passing  programming.  The  AEROLOG  computational  fluid  dynamic- 
software  developed  by  MATRA  was  one  of  these  applications  that  has  been 
ported  to  HPF.  It  is  devoted  to  the  study  of  compressible  fluid  flows  around 
complex  geometries.  This  paper  describes  the  port  of  the  AEROLOG  code  to- 
HPF  based  on  the  decomposition  of  subdomains.  It  outlines  the  parallelization 
strategy,  the  changes  of  the  data  structures  and  the  tuning  of  the  boundary  con¬ 
ditions  for  the  subdomains.  Performance  results  for  industrial  test  cases  with 
different  HPF  compilers  are  given  and  compared  with  the  results  of  the  mes¬ 
sage-passing  version. 


1.  Introduction 

High  Performance  Fortran  (HPF)  [7,8]  is  a  data  parallel,  high  level  programmin 
language  for  parallel  computing  that  is  expected  to  be  more  convenient  in  terms  or 
portability  and  maintainability  than  explicit  message  passing  and  to  allow  higher  pro¬ 
ductivity  in  software  development.  But  the  porting  of  key  commercial  applications  to 
HPF  is  still  of  critical  importance  for  the  continuing  development  and  acceptance  of 
HPF  as  a  standard  and  for  the  improvement  of  HPF  compilers. 

The  ESPRIT  project  “Open  HPF  Programming  Environments”  (PHAROS)  aimed 
to  assess  HPF  as  a  paradigm  for  porting  large  FORTRAN  77  scientific  applications  to 
shared  and  distributed  memory  architectures,  in  comparison  to  message-passing  pro¬ 
gramming.  The  PHAROS  project  was  funded  by  the  European  Union’s  ESPRIT  pro- 
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gram  for  research  and  development  in  information  technology.  It  was  a  two  years 
project,  running  from  January  1996  until  December  1997. 

To  this  end,  four  major  commercial  FORTRAN  77  application  codes  have  been 
successfully  ported  to  HPF  (structural  analysis,  CFD  and  electromagnetism  applica¬ 
tion  codes).  These  codes  already  had  message-passing  parallel  versions.  The  compari¬ 
son  of  HPF  to  message-passing  considered  factors  such  as: 

•  the  porting  effort; 

•  the  performance  of  the  resulting  code; 

•  the  portability  and  maintainability  of  the  resulting  code. 

One  of  the  PHAROS  applications  was  the  AEROLOG  computational  fluid  dynam¬ 
ics  software  of  MATRA  BAe  Dynamics  [1,5],  The  AEROLOG  code  is  a  proprietary 
CFD  software  devoted  to  the  study  of  compressible  fluid  flows  around  complex 
geometries.  For  many  years,  it  has  been  systematically  applied  during  the  aeronautical 
development  programs  MISTRAL,  MICA,  and  APACHE,  reducing  the  experimental 
studies  and  consequently  cutting  down  costs  and  delays. 

Together  with  HPF  experts  and  HPF  tool  providers,  the  version  AEROLOG-v3.2e 
(e  for  Euler  inviscid  fluid  model)  has  been  ported  to  HPF.  It  is  an  industry  relevant 
subset  of  the  latest  release  AEROLOG-v3.2  which  includes  all  the  functionalities  used 
in  today’s  applications.  This  full  release  AEROLOG-v3.2  is  composed  of  102  subrou¬ 
tines  with  about  19(XX)  lines  of  source  code  written  in  standard  FORTRAN  77.  The 
reduced  code  AEROLOG  V3.2e  is  composed  of  55  subroutines  with  about  1 1500  lines 
of  source  code.  It  is  equivalent  to  the  content  of  the  message  passing  version  of  the 
AEROLOG  code. 

In  accordance  to  the  workplan  of  the  PHAROS  project,  the  rest  of  this  paper  is  or¬ 
ganized  as  follows.  Section  2  describes  the  AEROLOG  software  and  section  3  out¬ 
lines  the  initial  port  to  HPF.  The  code  review  in  section  4  identified  the  problems  of 
the  initial  version  and  resulted  in  the  code  tuning  presented  in  section  5.  We  discuss 
our  expectations  lor  the  next  generation  of  HPF  compilers  in  section  6  to  overcome 
the  still  existing  problems  Finally,  we  compare  in  section  7  the  results  of  the  HPF 
versions  with  the  MP!  scrsion  and  conclude  in  section  8. 


2.  Description  of  the  AEROLOG  Software  and  the  Test  Cases 


2.1  The  AEROLOG  CFD  (  ode 

The  AEROLOG-v3  2  code  allows  the  simulation  of  steady  or  unsteady  inviscid  and 
compressible  fluid  flows  oser  three-dimensional  geometries  by  solving  the  Euler  sys¬ 
tem  of  partial  differential  equations.  It  utilizes  an  explicit  time  integration  scheme  of 
Lax-Wendroff  type.  It  is  second  order  accurate  in  time  and  allows  steady  flow  simula¬ 
tion  with  the  local  time-stepping  technique  or  unsteady  simulation  with  a  uniform  time 
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step  limited  by  the  so-called  CFD  stability  condition.  Another  functionality  is  the 
finite  volume  space  integration  scheme.  It  is  a  three-dimensional  extension  of  the  cell 
vertex  Ni  scheme  [4]  which  is  second  order  accurate  in  space  on  a  Cartesian  grid.  The 
formulation  is  fully  conservative  so  that  shock  and  expansion  waves  are  automatically 
captured. 

The  data  layout  is  based  on  a  multidomain  meshing  strategy.  The  global  mesh  is 
composed  of  an  assembly  of  locally  structured  three-dimensional  mesh  blocks  (I,  J,  K, 
families  of  mesh  lines).  All  types  of  degenerations  are  allowed  on  the  mesh  bounda¬ 
ries,  like  mesh  plans  degenerating  into  a  mesh  line  or  a  point.  This  is  very  useful  for 
the  meshing  of  complex  geometries,  but  it  requires  the  implementation  of  the  conven¬ 
ient  matching  conditions. 

The  most  time  consuming  part  of  the  code  is  the  subroutine  that  computes  the  time 
increments  for  the  physical  variables  at  each  time  step.  It  is  composed  of  a  succession 
of  calls  to  subroutines  that  can  be  sorted  into  two  groups: 

•  The  “local”  routines  are  called  independently  over  subdomains.  This  group  uses 
typically  up  to  90%  of  the  total  CPU  time  within  the  FORTRAN  77  code. 

•  The  “boundary”  routines  perform  the  boundary  conditions:  flow  conditions  (in¬ 
flow,  outflow,  walls,  etc.)  and  numerical  matching  conditions  (interfaces  be¬ 
tween  subdomains).  These  routines  make  an  intensive  use  of  indirect  addressing 
and  involve  dependencies  between  data  belonging  to  different  subdomains. 


2.2  CFD  Test  Cases 

Three  meshes  of  increasing  sizes  (see  Table  1)  have  been  build  around  the  same 
geometry  of  blunt  body.  The  free  flow  conditions  are:  Mach  number  equal  to  2.96  and 
angle  of  attack  equal  to  lOo.  With  these  conditions,  the  fluid  flow  over  the  blunt  body 
shows  strong  shock  and  expansion  waves:  characteristic  of  MATRA  industrial  appli¬ 
cations. 


Test  Case 
Name 

Si/e\  and  Number 

4  Subdomains 

Total  Mesh 
Points 

Processing 

Nodes 

Small 

f  '  «  X  9) X  8 

154440 

1-8 

Medium 

X  9) x  8 

304200 

2-16 

Large 

,  I  .'w  »  X  9)  X  8 

603720 

4-64  ■ 

Table  1.  Industrial  Test  Cases. 


Due  to  the  coarse  grain  pataJlelization  strategy  over  subdomains  (see  next  section), 
the  number  of  HPF  processo  is  limited  by  the  number  of  subdomains.  In  order  to  run 
a  number  of  HPF  tasks  higher  than  the  initial  number  of  subdomains,  a  pre-processor 
can  be  applied  [6].  This  pre  processor  takes  as  input  a  mesh  file  with  its  associated 
topological  description  and  generates  automatically  a  new  mesh  data  set  with  respect 
to  the  three  following  constraints  : 
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•  generate  the  given  number  of  mesh  blocks, 

•  optimize  the  load  balancing  (i.e.  homogeneous  mesh  block  sizes), 

•  minimize  the  size  of  blocks  interfaces. 


3.  Initial  Port  to  HPF 


3.1  Porting  to  Fortran  90 

Initially,  the  code  was  ported  from  FORTRAN  77  to  Fortran  90.  Apart  from  in¬ 
serting  F90  syntax,  e.g.  array  syntax  and  interface  blocks,  we  replaced  the  old  one 
dimensional  “work-array”  with  dynamic  arrays.  Some  of  the  arrays  became  allocat- 
able  arrays,  other  ones,  especially  for  local  data,  became  automatic  arrays.  These 
changes  made  the  code  more  flexible  as  the  static  size  of  the  workspace  is  no  longer 
given.  But  they  were  also  absolutely  necessary  to  allow  the  HPF  distribution  of  the 
mesh  data  in  a  useful  way. 

The  porting  to  Fortran  90  was  supported  by  the  Foresys  (FORtran  Engineering 
SYStem)  tool  from  SIMULOG  [11]  that  is  a  reverse-engineering,  migration  and  de¬ 
velopment  support  system  for  Fortran.  It  was  especially  useful  to  generate  interface 
blocks  with  intentions  for  the  dummy  arguments,  and  to  take  advantage  of  new  syntax 
and  new  language  features. 


3.2  Coarse  Grain  Parallelization  Strategy 

For  the  HPF  parallelization  we  have  chosen  the  following  strategy: 

•  The  loops  over  the  different  subdomains  calling  the  local  routines  provide  coarse 
grain  parallelism  without  communication.  The  HPF  mapping  directives  have  to 
guarantee  that  all  data  belonging  to  one  subdomain  is  completely  mapped  to  the 
same  processor. 

•  The  matching  conditions  of  the  interfaces  of  adjacent  subdomains  are  difficult  to 
handle.  The  initial  strategy  was  the  replication  of  the  data  and  computations  in¬ 
volved  in  the  boundary  conditions. 

In  the  AEROLOG  code,  the  data  of  the  two-  or  three-dimensional  subdomains  is 
linearized  and  stored  in  a  one-dimensional  array.  For  using  the  coarse  grain  parallel¬ 
ism,  it  is  essential  that  we  can  distribute  the  data  in  the  program  in  such  a  way  that  one 
subdomain  is  completely  owned  by  the  processor  that  will  work  on  this  data.  As  the 
subgrids  have  different  sizes,  it  would  be  necessary  that  HPF  supports  generalized 
block  distributions  where  the  user  can  pass  to  the  compiler  the  corresponding  block 
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sizes  for  each  processor.  Unfortunately,  none  of  the  commercial  HPF  compilers  sup¬ 
ported  this  feature  already  during  the  project  time.  For  this  reason,  we  had  to  reorgan¬ 
ize  the  one-dimensional  mesh  data  arrays  to  two-dimensional  ones.  The  second  di¬ 
mension  corresponds  to  the  subdomains  numbering  and  will  be  distributed  by  BLOCK 
(see  Fig.  1).  This  imposed  significant  changes  to  the  code,  not  only  for  the  arrays 
containing  mesh  data,  but  also  for  all  integer  arrays  used  for  indirect  addressing  of 
mesh  data. 


Fig.  1.  New  data  structures  for  the  mesh  data  and  their  distribution. 


3.3  Coarse  Grain  HPF  Implementation 

With  the  help  of  the  INDEPENDENT  directive,  we  enabled  the  parallelization  of 
the  loops  over  the  subdomains.  The  local  routines  are  defined  as  PURE  routines  to 
allow  their  parallel  execution  for  the  different  subdomains  (see  Fig.  2).  Furthermore, 
the  local  routines  have  not  to  be  parallelized  at  all  and  do  not  need  any  HPF  directive. 

The  AEROLOG  code  takes  advantage  of  sequence  association.  The  subdomains  are 
implicitly  reshaped  within  the  local  subroutines.  Within  one  subroutine,  one  subdo¬ 
main  is  always  considered  as  a  three-dimensional  rectangular  grid.  Though  the  HPF 
standard  does  not  allow  sequence  association  for  mapped  arguments,  we  could  rely  on 
it  as  long  as  it  is  only  used  for  a  single  subdomain  that  is  completely  mapped  to  one 
processor. 

For  the  boundary  routines,  the  values  of  the  boundary  nodes  of  the  different  sub- 
domains  are  gathered  from  the  distributed  mesh  data.  For  the  initial  HPF  port,  the  data 
and  computations  are  replicated  on  all  nodes  and  every  processor  updates  the  values  of 
its  boundary  mesh  nodes.  The  gathering  of  the  distributed  data  is  realized  by  replica- 
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tion  of  the  whole  mesh  data.  As  the  mesh  data  is  not  distributed  within  the  boundary 
routine,  implicit  remapping  at  subroutine  boundaries  is  utilized  (see  Fig.  3). 


integer  : :  NSDTOT  !  total  number  of  subdomains 

integer  NMAX  !  maximal  size  of  one  subdomain 

real,  dimension  (NMAX, NSDTOT)  F  !  mesh  data,  e.g.  force 
!hpf$  distribute  F  (*,bloclc)  !  distribute  the  subdomains 

integer,  dimension (NSDTOT)  ::  IM,  JM,  KM  !  sizes 

!hpf$  independent 

do  NSD  =  1,  NSDTOT 

call  L0CAL_R0UTINE  (F ( 1 , NSD) , IM(NSD) , JM (NSD) , KM (NSD) ,  ...) 
end  do 

call  BOUNDARY  (F,  NMAX,  NSDTOT,  ...) 

pure  subroutine  LOCAL_ROUTINE  (F, IM, JM, KM, , . . ) 

integer,  intent(in)  ::  IM,  JM,  KM 

real,  dimension (IM, JM, KM) ,  intent (inout)  ::  F 

end  subroutine  LOCAL_ROUTINE 

Fig.  2.  Outline  of  the  initial  HPF  AEROLOG  Code. 

subroutine  BOUNDARY  (F,  NMAX,  NSDTOT,  ISD_B,  IJK_B,  NB) 
integer,  intent  (in)  ::  NMAX,  NSDTOT,  NB 
integer,  dimension  (2,  NB) ,  intent  (in)  ::  ISD_B,  IJK_B 
real,  dimension  (NMAX,  NSDTOT),  intent  (inout)  ::  F 
!hpf$  distribute  F(*,*)  !  replicated  mesh  data 

real  ; :  XI,  X2 ,  X 

integer  ::  IB,  IJKl ,  ISDl,  I JK2 ,  ISD2 
do  IB=1,NB 

IJK2  =  IJK_B(2, IB) ;  ISD2  =  ISD_B(2,IB) 

IJKl  =  IJK_B(1, IB) ;  ISDl  =  ISD_B(1,IB) 

X  =  (F (IJKl, ISDl)  +  F(IJK2, ISD2) )  *  0.5 
F  (IJKl, ISDl)  =  X;  F(IJK2,ISD2)  =  X 
end  do 

end  subroutine 

Fig.  3.  Computation  of  boundary  conditions  in  the  AEROLOG  code. 
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4.  Code  Review 

We  tested  the  initial  HPF  port  with  the  following  compilers: 

•  NAS  HPFPlus  by  NA  Software  Liverpool,  Release  2.01,  a  commercial  HPF  com¬ 
piler  that  was  also  the  target  compiler  for  all  HPF  codes  in  the  PHAROS  project; 

•  PG  HPF  by  Portland  Inc.,  Oregon  [9],  AIX  Rel.  2.2-1,  another  commercial  HPF 
compiler; 

•  ADAPTOR  HPF  compiler,  version  5.1  (Oct.  1997)  [3],  developed  at  SCAI  in 
GMD,  a  research  compiler  that  is  available  as  public  domain. 

All  results  have  been  measured  on  the  IBM  SP2  at  the  GMD.  We  give  the  execu¬ 
tion  times  in  seconds  for  5  iterations  on  the  ‘small’  test  case  that  works  on  8  subdo¬ 
mains,  every  subdomain  contains  65  x  33  x  9  mesh  points  (see  also  Table  1). 


Fig.  4.  Execution  times  (in  seconds)  of  initial  HPF  version. 

Fig.  4  shows  the  execution  times  of  the  initial  HPF  version,  compiled  by  the  native 
Fortran  90  compiler  (xlf)  and  by  the  different  HPF  compilers  running  on  1,2,  4,  and  8 
processors.  The  HPF  version  (considered  as  a  Fortran  90  version  without  directives) 
has  nearly  the  same  performance  as  the  original  FORTRAN  77  version.  The  execution 
times,  separated  for  the  local  and  boundary  routines,  show  that  the  local  routines  are 
parallelized  perfectly.  They  scale  well  and  the  HPF  parallelization  causes  no  overhead. 
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But  all  boundary  routines  do  not  scale.  In  contrary,  the  execution  time  increases  with 
the  number  of  processors.  This  is  due  to  the  replication  of  distributed  data  that  in¬ 
volves  an  all-to-all  communication.  Furthermore,  it  shows  that  the  compilers  have 
already  different  support  for  this  kind  of  structured  communication  that  follows  a 
fixed  communication  pattern  where  every  processor  knows  which  data  has  to  be  sent 
and  to  be  received. 


□  NAS  ■PGI  BADP 


□  NAS  ■PGI  HADP 


NP=I  NP=2  NP=4  NP=8  NP=16  NP=I  NP=2  NP=4  NP=8  NP=I6 

(a)  Replication  of  16  kBytes  (b)  Replication  of  128  kByte.s 


Fig.  5.  Replication  of  distributed  data. 

In  order  to  estimate  the  cost  of  replications,  we  benchmarked  a  sample  code  that 
performs  only  data  replications  of  arrays  with  varying  sizes.  Fig,  5  shows  how  much 
time  (in  milliseconds)  the  replication  of  distributed  data  needs  for  the  different  com¬ 
pilers  and  for  the  different  number  of  processors.  Array  sizes  of  16  Kbytes  and  128 
Kbytes  are  considered.  The  time  for  replicating  distributed  data  increases  with  the  size 
of  the  array  and  with  the  number  of  processors.  The  ADAPTOR  runtime  .system  is 
able  to  recognize  at  runtime  that  the  replication  of  distributed  data  on  a  single  proces¬ 
sor  does  not  require  any  copying  at  all. 


5.  Tuning  of  the  HPF  Code 


As  the  results  of  the  initial  HPF  port  show,  only  the  tuning  of  the  boundary  routines 
is  necessary.  We  considered  two  strategies: 

•  We  let  the  mesh  arrays  distributed  for  the  boundary  routines  and  relied  on  the  capa¬ 
bilities  of  the  HPF  compiler  to  deal  with  unstructured  communication.  Unfortu¬ 
nately,  all  HPF  compilers  failed  to  generate  more  efficient  code  than  for  the  initial 
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HPF  version.  Especially  the  NAS  HPFPlus  compiler  provided  absolutely  no  effi¬ 
cient  support  for  indirect  addressing. 

•  Instead  of  replicating  the  whole  array  containing  the  mesh  data,  we  compressed  the 
full  mesh  data  to  the  boundary  data  before  replicating  it.  This  solution  needed  new 
data  structures  for  packing  and  unpacking  of  boundary  data  (see  Fig.  6).  The  pack¬ 
ing  of  the  data  can  be  done  independently  for  all  subdomains.  This  approach  re¬ 
quired  only  HPF  features  that  were  supported  by  all  HPF  compilers. 


Fig.  6.  Packing  of  boundary  data. 

By  the  packing  of  the  boundary  data,  much  less  data  has  to  be  replicated  between 
the  different  processors.  Only  the  boundary  data  and  not  the  whole  mesh  data  is  ex¬ 
changed  between  the  processors.  The  results  shown  in  Fig.  7  verify  the  effectiveness 
of  the  chosen  approach.  Compared  to  the  sequential  version,  speedups  from  4  to  5  on  8 
processors  are  achieved. 


6.  Expectations  for  the  Next  Generation  of  HPF  Compilers 

The  current  tuned  HPF  version  is  still  not  fully  portable  between  the  different  HPF 
compilers  as  the  calling  of  local  subroutines  within  an  independent  loop  is  supported 
differently.  The  NAS  HPFPlus  compiler  did  not  act  at  all  upon  the  INDEPENDENT 
directive,  but  scheduled  the  local  computations,  defined  as  HPF_SERIAL  routines 
and  not  as  PURE  routines,  on  the  processors  owning  the  subdomain.  The  need  for  the 
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slightly  different  versions  of  the  HPF  code  should  become  redundant  with  the  next 
releases  of  the  HPF  compilers. 

Considering  HPF  2.0  [8],  we  expect  support  for  general  block  distributions.  This 
would  avoid  the  additional  dimension  for  the  subdomains  and  there  would  be  no  more 
wasting  of  memory  in  case  of  different  subdomain  sizes  (see  also  Fig.  1).  This  feature 
is  absolutely  necessary  to  combine  the  evolution  of  the  serial  and  the  HPF  version  of 
the  AEROLOG  code. 


Fig.  7.  H  xctution  times  (in  seconds)  of  tuned  HPF  version. 


The  Amdahl  limit  ff%irkt\  the  maximum  speed-up  as  long  as  the  boundary  compu¬ 
tations  are  not  parallcli/c J  But  then  unstructured  communication  has  to  be  supported. 
Therefore  the  compiler  tu<»  i>>  build  a  schedule  required  for  accessing  remote  items  of 
distributed  arrays  and  t  "*  .  .HTimunication  optimization.  Unfortunately,  schedules  can¬ 
not  be  worked  out  at  ci'mpiie  time,  but  only  at  run-time,  when  the  values  of  the  indi¬ 
rection  arrays  are  knovx  n  Uk-  code  design  which  first  builds  the  schedule,  then  uses  it 
to  carry  out  the  actual  ..ommunication  and  computation,  has  been  coined  as  the  //;- 
spector/executor  scheme  lldl.  The  PCI  and  ADAPTOR  compiler  followed  this  de¬ 
sign,  but  only  the  latest  release  of  ADAPTOR  tool  provided  sufficient  support  for 
reusing  communication  schedules  by  an  additional  directive  [2]. 
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Fig.  8  presents  results  for  the  ‘medium’  test  case  (100  iterations),  for  the  local  rou¬ 
tines  (local)  and  for  different  implementations  of  the  boundary  routines.  The  replica¬ 
tion  strategy  of  the  initial  and  tuned  HPF  version  does  not  scale.  The  unstructured 
communication  (unstr.)  for  the  parallelized  boundary  computations  scales,  but  pro¬ 
duces  an  unacceptable  overhead  due  to  the  high  costs  for  building  the  communication 
schedule.  If  the  communication  schedule  can  be  reused,  e.g.  by  tracing  modifications 
of  the  indirection  array  as  described  in  [2],  the  unstructured  communication  produces 
good  results  (traced). 


□  local  ■  initial  ■  tuned  ■  unstr.  B  traced 


120 
100 
80 
60 
40 
20 
0 

Fig.  8.  Different  tuning  strategies  of  the  boundary  computations  (ADAPTOR). 


P=2  P=4  P=8  P=16 


7.  Benchmarking  and  Comparison  with  Message  Passing 

The  porting  of  the  AEROLOG  code  to  HPF  required  important  code  changes  as 
well  as  the  message  passing  port,  but  it  could  be  done  step  by  step,  always  having  a 
running  version.  This  porting  included  useful  code  cleaning  and  modernized  memory 
management.  The  replacement  of  the  super-array  technique  in  favor  of  Fortran  90 
dynamic  allocation  of  the  local  arrays  brings  simplification  and  flexibility  to  the  code. 
But  many  code  changes  were  only  required  due  to  the  limited  capabilities  of  the  HPF 
compilers. 

The  HPF  and  the  message  passing  version  achieve  nearly  the  same  performance  for 
smaller  number  of  proce.ssors.  But  the  message  passing  version  of  the  AEROLOG 
code  scales  better  (see  Fig.  9).  It  parallelizes  also  the  boundary  routines  and  takes 
advantage  of  reusing  explicitly  communication  schedules.  But  with  an  HPF  compiler 
that  supports  unstructured  communication  and  reuses  schedules  the  scalability  of  the 
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MPI  version  can  be  nearly  achieved  as  the  results  with  the  ADAPTOR  compilation 
system  verify. 

From  the  code  development  and  maintenance  point  of  view,  it  is  possible  to  replace 
smoothly  the  FORTRAN  77  reference  code  in  favor  of  the  Fortran  90/HPF  code.  The 
benefits  of  this  migration  will  be  the  merging  of  the  sequential/parallel 
shared/distributed  memory  versions  of  AEROLOG,  which  have  reached  very  different 
levels  of  development  at  the  moment.  The  migration  of  the  complete  AEROLOG  code 
(implicit  solver,  Navier-Stokes  solver,  etc.)  is  eased  by  the  choice  of  the  coarse  grain 
parallelization  strategy  based  on  the  multidomain  approach.  In  particular,  it  is  not 
necessary  to  rewrite  the  local  algorithms  which  can  remain  FORTRAN  77,  saving  a 
lot  of  porting  efforts  and  bug  risks.  Experimental  results  with  the  ADAPTOR  compiler 
have  also  shown  that  the  HPF  version  is  well  suited  for  shared  memory  architectures 
by  translating  the  HPF  directives  into  parallelization  directives  for  the  native  compiler. 
Due  to  the  shared  memory,  runtime  support  for  unstructured  communication  is  not 
necessary. 

In  any  case,  we  have  seen  from  the  PHAROS  final  benchmarks  that  it  is  not  possi¬ 
ble  at  the  moment  to  get  rid  of  the  message-passing  version  of  the  code,  which  is  the 
only  one  able  to  run  efficiently  enough  on  massively  parallel  computers.  Enhance¬ 
ments  of  the  HPF  compiler  technology  are  still  required  to  a  complete  migration  to 
HPF. 


□  NAS  (tuned)  HPCI  (tuned)  HADP  (tuned)  QADP  (traced)  OMPI 


NP=I  NP=2  NP=4  NP=8  NP=I6  NP=.12 


Fig,  9.  Speedups  on  industrial  test  cases. 
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8.  Conclusions 

With  the  end  of  the  PHAROS  project,  we  have  an  HPF  version  of  the  AEROLOG 
code  that  runs  with  at  least  three  HPF  compilers  and  produces  acceptable  results  for  a 
limited  number  of  processors.  The  porting  effort  was  higher  than  expected  because 
code  restructuring  was  required  in  order  to  achieve  the  HPF  implementation  of  the 
coarse  grain  parallel  strategy.  As  HPF  concepts  are  rather  complicated  for  non  spe¬ 
cialists,  the  know-how  transfer  from  tool  providers  and  experts  to  the  end-user  was 
very  important  and  might  be  considered  as  a  major  benefit  of  the  PHAROS  project. 

At  this  time,  the  code  is  not  fully  portable  as  different  language  features  are  used 
for  the  two  commercial  HPF  compilers.  This  is  not  only  due  to  the  missing  support  in 
the  compilers,  but  also  due  to  fact  that  the  HPF  standard  was  not  not  rigid  enough,  so 
that  HPF  directives  led  to  various  interpretations.  With  future  releases  of  the  HPF 
compilers,  these  problems  will  disappear.  The  tuning  of  the  boundary  conditions  re¬ 
quired  a  lot  of  effort.  This  effort  might  be  less  with  advanced  HPF  compilers  where 
unstructured  communication  is  better  supported. 

The  HPF  version  can  directly  be  compiled  for  a  serial  machine  achieving  the  same 
performance  than  the  original  code.  While  the  independent  computations  over  the 
subdomains  scale  well,  the  boundary  conditions  remain  the  critical  part,  even  in  the 
tuned  version.  Replication  of  mesh  data  is  rather  expensive,  unstructured  communica¬ 
tion  is  not  well  supported.  Due  to  the  replication  of  the  boundary  computations,  the 
scalability  of  this  version  is  limited  in  any  case. 

Nevertheless,  experimental  results  with  the  research  compilation,  system  ADAP¬ 
TOR  verify  that  a  scalable  and  efficient  HPF  parallelization  of  the  AERLOG  software 
is  possible  if  general  block  distributions  and  unstructured  communication  are  suffi¬ 
ciently  supported. 

In  conclusion  of  this  project,  we  can  state  that  HPF  is  a  useful  paradigm  for  porting 
large  FORTRAN  77  applications  to  parallel  architectures  and  in  the  long  run  the  better 
alternative.  But  an  ctfitient  and  portable  parallelization  and  a  higher  productivity  in 
software  development  van  only  be  achieved  if  HPF  compilers  improve  substantially. 
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Abstract.  Actual  behaviour  of  parallel  programs  is  of  capital 
importance  for  the  development  of  an  application.  Programs  will 
be  considered  matured  applications  when  their  performance  is 
under  acceptable  limits.  Traditional  parallel  programming 
forces  the  programmer  to  understand  the  enormous  amount  of 
performance  information  obtained  from  the  execution  of  a 
program.  In  this  paper,  we  propose  an  automatic  analysis  tool 
that  lets  the  programmers  of  applications  avoid  this  difficult 
task.  This  automatic  performance  analysis  tool  main  objective  is 
to  find  poor  designed  structures  in  the  application.  It  considers 
the  trace  file  obtained  from  the  execution  of  the  application  in 
order  to  locate  the  most  important  behaviour  problems  of  the 
application.  Then,  the  tool  relates  them  with  the  corresponding 
application  code  and  scans  the  code  looking  for  any  design 
decision  which  could  be  changed  to  improve  the  behaviour 


1.  Introduction: 

The  performance  of  a  parallel  program  is  one  of  the  main  reasons  for  designing 
and  building  a  parallel  program  [1].  When  facing  the  problem  of  analysing  the 
performance  of  a  parallel  program,  programmers,  designers  or  occasional  parallel 
systems  users  must  acquire  the  necessary  knowledge  to  become  performance  analysis 
experts. 

Traditional  parallel  program  performance  analysis  has  been  based  on  the 
visualization  of  several  execution  graphical  views  [2,  3,  4.  5],  These  high  level 
graphical  views  represent  an  abstract  description  of  the  execution  data  obtained  from 
many  possible  sources  and  even  different  executions  of  the  same  program  [6]. 


'  This  work  has  been  supported  by  the  CICYT  under  contract  TIC  95-0868 
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The  amount  of  data  to  be  visualized  and  analyzed,  together  with  the  huge  number 
of  sources  of  information  (parallel  processors  and  interconnecting  network  states, 
messages  between  processes,  etc.)  make  this  task  of  becoming  a  performance  expert 
difficult.  Programmers  need  a  high  level  of  experience  to  be  able  to  derive  any 
conclusions  about  the  program  behaviour  u.sing  these  visualisation  tools.  Moreover, 
they  also  need  to  have  a  deep  knowledge  of  the  parallel  system  because  the  analysis 
of  many  performance  features  must  consider  architectural  aspects  like  the  topology  of 
the  system  and  the  interconnection  network. 

In  this  paper  we  describe  a  Knowledge-based  Automatic  Parallel  Program 
Analyser  for  Performance  Improvement  (KAPPA-PI  tool)  that  eases  the  performance 
analysis  of  a  parallel  program.  Analysis  experts  look  for  special  configurations  of  the 
graphical  representations  of  the  execution  which  refer  to  problems  at  the  execution  of 
the  application.  Our  purpose  is  to  substitute  the  expert  with  an  automatic  analysis  too! 
which,  based  on  a  certain  knowledge  of  what  the  most  important  performance 
problems  of  the  parallel  applications  are,  detects  the  critical  execution  problems  of 
the  application  and  shows  them  to  the  application  programmer,  together  with  source 
code  references  of  the  problem  found,  and  indications  on  how  to  overcome  the 
problem. 


We  can  find  other  automatic  performance  analysis  tools: 

-Paradyn  [7]  focuses  on  minimising  the  monitoring  overhead.  The 
Paradyn  tool  performs  the  analysis  “on  the  fly”,  not  having  to  generate  a 
trace  file  to  analyse  the  behaviour  of  the  application.  It  also  has  a  list  of 
hypotheses  of  execution  problems  that  drive  the  dynamic  monitoring. 

-  AIMS  tool  [8],  is  a  similar  approach  to  the  problem  of  performance 
analysis.  The  tool  builds  a  hierarchical  account  of  program  execution  time 
spent  on  different  operations,  analyzing  in  detail  the  communications 
performed  between  the  processes. 

-Another  approach  to  addressing  the  problem  of  analysing  parallel 
program  performance  is  carried  out  by  [9]  and  [10].  The  solution  proposed  is 
to  build  an  abstract  representation  of  the  program  with  the  help  of  an 
assumed  programming  model  of  the  parallel  system.  This  abstract 
representation  of  the  program  is  analysed  to  predict  some  future  aspects  of 
the  program  behaviour.  The  main  problem  of  this  approach  is  that,  if  the 
program  is  modelled  from  a  high  level  view,  some  important  aspects  of  its 
performance  may  not  be  considered,  as  they  will  be  hidden  under  the 
abstract  representation. 

-  Performance  of  a  program  can  also  be  measured  by  a  pre-compiler,  like 
Fortran  approaches  (P3T  [11],  this  approach  is  not  applicable  to  all  parallel 
programs,  especially  those  where  the  programmer  expresses  dynamic 
unstructured  behaviour. 
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Our  KAPPA-PI  tool  is  currently  implemented  (in  Perl  language  [12])  to  analyse 
applications  programmed  under  the  PVM  [13]  programming  model.  The  KAPPA-Pl 
tool  bases  the  search  for  performance  problems  on  its  knowledge  of  their  causes.  The 
analysis  tool  makes  a  “pattern  matching”  between  those  execution  intervals  which 
degrade  performance  and  the  “knowledge  base”  of  causes  of  the  problems.  This  is  a 
process  of  identification  of  problems  and  creation  of  recommendations  for  their 
solution.  This  working  model  allows  the  “performance  problem  data  base”  to  adapt  to 
new  possibilities  of  analysis  with  the  incorporation  of  new  problems  (new  knowledge 
data)  derived  from-  the  experimentation  with  programs  and  new  types  of 
programming  models. 

In  section  2,  we  describe  the  analysis  methodology  briefly,  explaining  the  basis  of 
its  operations  and  the  processing  steps  to  detect  a  performance  problem.  Section  3 
presents  the  actual  analysis  of  a  performance  problem  detected  in  an  example 
application.  Finally,  section  4  exposes  the  conclusions  and  future  work  on  the  tool 
development. 


2.-  Automatic  analysis  overview. 

The  objective  of  the  automatic  performance  analysis  of  parallel  programs  is  to 
provide  information  regarding  the  behaviour  of  the  user’s  application  code. 

This  information  may  be  obtained  analysing  statically  the  code  of  the  parallel 
program.  However,  due  to  the  dynamic  behaviour  of  the  processes  that  form  the 
program  and  the  parallel  system  features,  this  static  analysis  may  not  be  sufficient. 

Then,  execution  information  is  needed  to  effectively  draw  any  conclusion  about 
the  behaviour  of  the  program.  This  execution  information  can  be  collected  in  a  trace 
file  that  includes  all  the  events  related  to  the  execution  of  the  parallel  program. 
However,  the  information  included  in  the  trace  file  is  not  significant  to  the  user  who 
is  only  concerned  with  the  code  of  the  application. 

The  automatic  performance  analysis  tool  concentrates  on  analysing  the  behaviour 
of  the  parallel  application  expressed  in  the  trace  file  in  order  to  detect  the  most 
important  performance  problems.  Nonetheless,  the  analysis  process  can  not  stop  there 
and  must  relate  the  problems  found  with  the  actual  code  of  the  application.  In  this 
way,  user  receives  meaningful  information  about  the  application  behaviour. 

In  figure  I,  we  represent  the  basic  analysis  cycle  followed  by  the  tool  to  analyse 
the  behaviour  of  a  parallel  application. 
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Fig.  1.  Schema  of  the  analysis  of  a  parallel  application 

The  analysis  first  considers  the  study  of  the  trace  file  in  order  to  locate  the  most 
important  performance  problems  occurring  at  the  execution.  Once  those  problematic 
execution  intervals  have  been  found,  they  are  studied  individually  to  determinate  the 
type  of  performance  problem  for  each  execution  interval. 


When  the  problem  is  classified  under  a  specific  category,  the  analysis  tool  scans 
the  segment  of  application  source  code  related  to  the  execution  data  previously 
studied.  This  analysis  ot  the  code  brings  out  any  design  problem  that  may  have 
produced  the  performance  problem.  Finally,  the  analysis  tool  produces  an  explanation 
of  the  problems  found  at  this  application  design  level  and  recommends  what  should 
be  changed  in  the  application  code  to  improve  its  execution  behaviour. 

In  the  following  points,  the  operations  performed  by  the  analysis  tool  are  explained 
in  detail. 
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2.1.  Problem  Detection 

The  first  part  of  the  analysis  is  the  study  of  the  trace  file  obtained  from  the 
execution  of  the  application.  In  this  phase,  the  analysis  tool  scans  the  trace  file, 
obtained  with  the  use  of  TapePVM  [14],  with  the  purpose  of  following  the  evolution 
of  the  efficiency  of  the  application.  The  application  efficiency  is  basically  found  by 
measuring  the  number  of  processors  that  are  executing  the  application  during  a 
certain  time. 

The  analysis  tool  collects  those  execution  time  intervals  when  the  efficiency  is 
minimum.  These  intervals  represent  those  situations  where  the  application  is  not 
using  all  the  capabilities  of  the  parallel  machine.  They  could  be  evidence  of  an 
application  design  fault.  In  order  to  analyse  these  intervals  further,  the  analysis  tool 
selects  the  most  important  inefficiencies  found  at  the  trace  file.  More  importance  is 
given  to  those  inefficiency  intervals  that  affect  the  most  number  of  processors  for  the 
longest  time. 


2.2.  Problem  Determination 

Once  the  most  important  inefficiencies  are  found,  the  analysis  tool  proceeds  to 
classify  the  performance  with  the  help  of  a  “knowledge  base”  of  performance 
problems.  This  classification  is  implemented  in  the  form  of  a  problem  tree,  as  seen  in 
figure  2. 

Innefidency  patterns 


Lack,  of  ready  tasks  Mapping  Problems 


CAUSES 


blocked  sender  slowcomm.  multiple  output  master/slave  Lack  of  parallelism  barrier  problems 


Fig.  2.  Classification  of  the  performance  problems  of  an  application 

Each  inefficiency  interval  at  the  trace  is  exhaustively  studied  in  order  to  find  which 
branches  in  the  tree  describe  the  problem  in  a  more  accurate  way.  When  the 
classification  of  the  problem  arrives  at  the  lowest  level  of  the  tree,  the  tool  can 
proceed  to  the  next  stage,  the  source  code  analysis 


485 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


2.3.  Application  of  the  source  code  analysis. 

At  this  stage  of  the  program  evaluation,  the  analysis  tool  has  found  a  performance 
problem  in  the  execution  trace  file  and  has  classified  it  under  one  category. 

The  aim  of  the  analysis  tool  at  this  point  is  to  point  out  any  relationship  between 
the  application  structure  and  the  performance  problem  found.  This  detailed  analysis 
differ  from  one  performance  problem  to  another,  but  basically  consists  of  the 
application  of  several  techniques  of  pattern  recognition  to  the  code  of  the  application. 

First  of  all,  the  analysis  tool  must  select  those  portions  of  source  code  of  the 
application  that  generated  the  performance  problem  when  executed.  In  order  to 
establish  a  relationship  between  the  executed  processes  and  the  program  code,  the 
analysis  tool  builds  up  a  table  of  process  identificators  and  their  corresponding  code 
modules  names. 

With  the  help  of  the  trace  file,  the  tool  is  able  to  relate  the  execution  events  of 
certain  operations,  like  sending  or  receiving  a  message,  to  a  certain  line  number  in  the 
program  code.  Therefore,  the  analysis  tool  is  able  to  find  which  instructions  in  the 
source  code  generated  a  certain  behaviour  at  execution  time.  Each  pattern-matching 
technique  tries  to  test  a  certain  condition  of  the  source  code  related  to  the  problem 
found.  For  each  of  the  matches  obtained  in  this  phase,  the  analysis  tool  will  generate 
.some  explanations  of  the  problem  found,  the  bounds  of  the  problem  and  wbat 
possible  alternatives  there  are  to  alleviate  the  problem. 

The  list  of  performance  problems,  as  well  as  their  implications  of  the  source  code 
of  the  application  is  shown  at  table  1.  A  more  exhaustive  description  of  the 
classification  can  be  found  at  [15]. 
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NAME 

DESCRIPTION 

TRACE 

INFORMATION 

SOURCE  CODE 
IMPLICATIONS 

gProblems  I 

Mapping 

problem 

There  are  idle 
processors  and  ready- 
to-execute  processes 
in  busy  processors 

Processes  assignments 

to  busy  processors, 
number  of  ready 
processors 

Solutions  affect 
the  process- 

processor  mapping 

Conimunication  Related  \ 

Blocked 

Sender 

A  blocked  process  is 
waiting  for  a  message 
from  another  process 
that  is  already 

blocked  for  reception. 

Waiting  receive  times 
of  the  blocked 

processes.  Process 

identifiers  of  the 
sender  partner  of  each 
receive. 

Study  of  the 

dependencies 
between  the 

processes  to 

eliminate  waiting. 

Multiple 

Output 

Serialization  of  the 
output  messages  of  a 
process. 

Identification  of  the 
sender  process  and  the 
messages  sent  by  this 
process. 

Study  of  the 
dependencies 
between  the 

messages  sent  to 
all  receiving 

processes. 

Long 

Communic 

ation 

Long  communications 
block  the  execution  of 
parts  of  the  program. 

Time  spent  waiting. 
Operations  performed 
by  the  sender  at  that 
time. 

Study  of  the  size  of 
data  transmitted 
and  delays  of  the 
interconnection 
network. 

Program  Structure  Related  \ 

Master/ 

Slave 

problems 

The  number  of 

masters  and 

collaborating  slaves  is 
not  optimum. 

Synchronization  times 
of  the  slaves  and 
master  processes. 

Modications  of  the 
number  of 

slave.s/masters. 

Barrier 

problems 

Barrier  primitive 

blocks  the  execution 
for  too  much  time. 

Identification  of 

barrier  proces.ses  and 
time  spent  waiting  for 
barrier  end. 

Study  of  the  latest 
processes  to  arrive 
at  the  barrier. 

Lack  of 

parallelism 

Application  design 
does  not  produce 
enough  proces.ses  to 
fill  all  processors 

Analysis  of  the 

dependences  of  the 
next  processes  to 

execute. 

Possibilities  of 

increasing 
parallelism  by 

dividing  processes 

Table  1.  Performance  problems  detected  by  the  analysis  tool. 

In  the  next  section,  we  illustrate  the  process  of  analysing  a  parallel  application  with 
the  use  of  an  example. 
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3.  Example:  analysis  of  an  application. 

In  this  example  we  analyse  a  tree-like  application  with  important  amount  of 
communications  between  processes.  The  application  is  executed  mapping  each 
process  to  a  different  processor.  From  the  execution  of  the  application  we  obtain  a 
trace  file,  which  is  shown  as  a  time-space  diagram,  together  with  the  application 
structure,  in  figure  3. 


Fig.  3.  Application  trace  file  space-time  diagram 

In  the  next  points  we  follow  the  operations  carried  out  by  the  tool  when  analysing 
the  behaviour  of  the  parallel  application. 


3.1.  Problem  Detection 


First  of  all,  the  trace  is  scanned  to  look  for  low  efficiency  intervals.  The  analysis 
tool  finds  an  interval  of  low  efficiency  when  processors  P2  and  P3  are  idle  due  to  the 
blocking  of  the  processes  “Mini”  and  “MaxO”.  Then,  the  execution  interval  (tl,t2)  is 
considered  for  further  study. 


3.2.  Problem  Determination 


The  analysis  tool  tries  to  classify  this  problem  found  under  one  of  the  categories. 
To  do  so,  it  studies  the  number  of  ready-to-execute  processes  in  the  interval.  As  there 
are  no  such  kind  of  processes,  it  classifies  the  problem  as  “lack  of  ready  processes”. 
The  analysis  tool  also  finds  that  the  processors  are  not  just  idle,  but  waiting  for  a 
message  to  arrive,  so  the  problem  is  classified  as  a  communication  related. 

Then,  the  analysis  tool  must  find  out  what  the  appropriate  communication  problem 
is.  It  starts  analyzing  the  last  proce.ss  (MaxO)  which  is  waiting  for  a  message  from 
Min  1  process.  When  the  tool  tries  to  study  what  the  Min  1  process  was  doing  at  that 
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time,  it  finds  that  Mini  was  already  waiting  for  a  message  from  Max2,  so  the  analysis 
tool  classifies  this  problem  as  a  blocked  sender  problem,  sorting  the  process 
sequence:  Max2  sends  a  message  to  Mini  and  Mini  sends  a  message  to  MaxO. 


3.3.  Analysis  of  the  source  code. 

In  this  phase  of  the  analysis,  the  analysis  tool  wants  to  analyse  the  data 
dependencies  between  the  messages  sent  by  processes  Max2,  Mini  and  MaxO  (see 
figure  3). 

First  of  ail,  the  analysis  tool  builds  up  a  table  of  the  process  identifiers  and  each 
source  C  program  name  of  the  processes. 

When  the  program  names  are  known,  the  analysis  tool  opens  the  source  code  file 
of  process  Mini  and  scans  it  looking  for  the  send  and  the  receive  operations 
performed.  From  there,  it  collects  the  name  of  the  variables  which  are  actually  used  to 
send  and  receive  the  messages.  This  part  of  the  code  is  expressed  on  figure  4. 


1  pvin_recv  ( -1 , -1 )  ; 

2 

3 

4 

5 

6 

7  for (i=0 ; i<sons ; i++) 

8  { 

9  pvin_initsend  ( PvmDataDef  ault )  ; 

10 
11 
12 

13 

14  } 


ni.4.  tfi/i/.c”  relevant  portion  of  source  code 


pvm_pkf  1  (iccalcl,  1,1); 


pvin_ser.d  '  Cid_son[i]  ,  1)  ; 


pvm_upkfl {&calc,  1,1)  ; 


calcl  =  min ( calc,  1); 


When  the  variahlcN  ate  i.-ond  (“calc”  and  “calcl”  at  the  example)  ,  the  analysis 
tool  starts  searching  (he  v.Kjrce  code  of  process  “Mini”  to  find  all  possible 
relationships  between  Nah  ^  mables.  As  these  variables  define  the  communication 
dependence  of  the  ihe  results  of  these  tests  will  describe  the  designed 

relationship  between  ihe  (m  ..i-vses. 

In  this  example,  the  dependency  test  is  found  true  due  to  the  instruction  found  at 
line  5.  which  relates  “cuk  I  » iih  the  value  of  “calc”.  This  dependency  means  that  the 
message  sent  to  proce.ss  .M.ixO”  depends  on  the  message  received  from  process 
“Max2”. 
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The  recommendation  produced  to  the  user  explains  this  situation  of  dependency 
found.  The  analysis  tool  suggests  the  modification  of  the  design  of  the  parallel 
application  in  order  to  distribute  part  of  the  code  of  process  “MinT”  (the  instructions 
that  modify  the  variable  to  send)  to  process  “MaxO”,  and  then  send  the  same  message 
to  “Min  1”  and  to  “MaxO”.  This  message  shown  to  the  user  is  expressed  in  figure  5. 


Analysing  MaxMin. . . . 

A  Blocked  Sender  situation  has  been  found  in  the 
execution. 

Processes  involved  are: 

MaxO ,  Mini ,  Max2 

Recommendation:  A  dependency  between  Max2  and  MaxO  has 
been  found. 

The  design  of  the  application  should  be  revised. 

Line  25  of  Mini  process  should  be  distributed  to  MaxO . 

Fig.  5.  Oulpiii  of  the  analysis  tool 


The  line  referred  in  the  recommendations  of  the  tool  (Line  5  of  Mini  Process) 
should  be  executed  in  the  process  MaxO,  so  variable  "calc”  must  be  sent  to  MaxO  to 
solve  the  expression.  Then,  the  codes  of  the  processes  may  be  changed  as  follows  in 
figure  6. 


pvm_recv{-l, -1) ; 

pvm_recv (-1,-1) ; 

pvm_upkfl (icalc, 1,1); 

pvm_upkfl (&calc, 1,1); 

calcl  =  min(calc, 1) ; . 

calcl  =  min (calc, 1); 

Process  MaxO  Process  Min  I 


calc  =  min{old,myvalue); 
pvm_initsend (PvmDataDefault) ; 
pvm_pkfl  (&:calcl  ,1,1); 
pvm_send(tid_Minl ,  1)  ; 
pvm_send ( t id_Max2 , 1 ) ; 


Process  Max2 


Fig.  6.  New  process  code 


In  the  new  proce.s.ses  code,  the  dependencies  between  Mini  and  Max2  proce.sses 
have  been  eliminated.  From  the  execution  of  these  processes  we  obtain  a  new  trace 
file,  shown  in  figure  7.  In  the  figure,  the  process  MaxO  does  not  have  to  wait  so  long 
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until  the  message  arrives.  As  a  consequence,  the  execution  time  of  this  part  of  the 
application  has  been  reduced. 


Fig.  7.  Space-state  diagram 


of  the  new  execution  of  the  application 


4.  Conclusions 

This  automatic  analysis  tool  is  designed  for  programmers  of  parallel  applications 
that  want  to  improve  the  behaviour  of  their  applications.  The  application 
programmers’  view  of  the  tool  is  quite  simple:  the  application  is  brought  to  the 
analysis  tool  as  input  and,  after  the  analysis,  the  programmer  receives  a  list  of 
suggestions  to  improve  the  performance  of  the  program.  Those  suggestions  explain, 
at  programmer  level,  which  problems  have  been  found  in  the  execution  of  the 
application  and  how  to  solve  them  changing  the  program  code. 

Nonetheless,  when  applying  the  suggested  changes  to  the  application  code,  other 
new  performance  problems  could  appear.  Programmers  must  be  aware  of  the 
behaviour  side-effects  of  introducing  changes  in  the  applications.  Hence,  once  the 
application  code  is  rebuilt,  new  analysis  should  be  considered.  This  new  analysis 
must  be  tested  to  find  a  set  of  representative  input  data  in  order  to  analyse  the 
execution  of  the  application  comprehensively  with  a  trace  file. 

Moreover,  .some  problems  may  be  produced  by  more  than  one  cause.  Sometimes  it 
is  difficult  to  separate  the  different  causes  of  the  problems  and  propose  the  most 
adequate  solution.  This  process  of  progressive  analysis  of  problems  with  multiple 
causes  is  one  of  the  future  fields  of  tool  development. 

Future  work  on  the  tool  will  consider  the  increment  and  refinement  of  the  causes 
of  performance  problems,  the  “knowledge  base”.  The  programming  model  of  the 
analysed  applications  must  also  be  extended  from  the  currently  used  (PVM)  to  other 
parallel  programming  paradigms. 
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Due  to  the  general  use  of  a  few  parallel  execution  trace  formats  [16,  4]  and 
programming  libraries,  it  is  possible  to  have  similar  kind  of  performance  data  of 
many  different  applications  running  on  different  parallel  systems.  Although  we  have 
found  that  additional  trace  information  (which  is  not  easily  obtained)  can  alleviate  the 
analysis  task  to  a  high  degree. 

But  far  greater  efforts  must  be  focused  on  the  optimisation  of  the  search  phases  of 
the  program.  The  search  for  problems  in  the  trace  file  and  the  analysis  of  causes  for  a 
certain  problem  must  be  optimised  to  operate  on  very  large  trace  files.  The 
computational  cost  of  analysing  the  trace  file  to  derive  these  results  is  not  irrelevant, 
although  the  tool  is  built  not  to  generate  much  more  overhead  than  the  visual 
processing  of  a  trace  file. 

The  tree-structure  of  the  problems  helps  to  eliminate  the  testing  of  some 
hypotheses,  but  may  complicate  the  analysis  when  considering  problems  with 
multiple  causes  (at  different  levels  of  the  tree). 
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Abstract.  In  this  work  we  have  studied  the  influence  of  the  vector  reg¬ 
ister  size  over  two  different  concepts  of  vector  architectures.  We  have 
observed  that,  long  vector  registers  play  an  important  role  in  a  conven¬ 
tional  vector  architecture.  However,  we  observed  that  even  using  highly 
vectorizable  codes,  only  a  small  fraction  of  that  large  vector  registers  is 
used.  Nevertheless,  we  have  observed  that,  reducing  vector  register  size  on 
a  conventional  vector  architecture,  result  in  a  severe  performance  degra¬ 
dation,  providing  slowdowns  in  the  range  of  1.8  to  3.8.  When  we  includ¬ 
ing  an  out-of-order  execution  on  a  vector  architecture,  the  necessity  of 
long  vector  registers,  is  reduced.  We  have  used  a  trace  driven  approach 
to  simulate  a  selection  of  the  Perfect  Club  and  Specfp92  programs.  The 
results  of  the  simulations  show  that,  the  register  size  reduction  on  an  out- 
of-order  vector  architecture  is  less  negative  than  in  a  conventional  vector 
machine,  providing  slowdowns  in  the  range  of  I.O4  to  1.9.  Even  when 
reducing  the  registers  size  to  1/4  the  original  size  on  an  out-of-order  ma¬ 
chine,  the  slowdown  provided  is  in  the  range  of  I.O4  to  1.5,  but  it  still 
is  better  than  a  conventional  vector  machine.  Finally,  when  comparing 
both  architectures,  using  the  same  register  file  size,  (8kb),  we  can  see  that 
the  performance  gained  by  using  out-of-order  execution  is  in  the  range  of 
1.13  to  1.40. 

1  Introduction 

Numerical  applications  have  been  the  area  where  vector  architectures  have  proved 
their  efficiency.  This  vector  architectures  have  used  in-order  execution,  limited 
form  of  ILP  techniques  and  large  latencies  memory  systems.  In  order  to  achieve 
good  performance  and  to  be  able  to  tolerate  the  large  latencies,  this  kind  of 
processors  have  exploited  the  data  level  parallelism  embedded  in  each  vector 
instruction  and  have  allowed  the  overlapping  of  vector  and  scalar  instructions 

*  On  leave  from  the  Centro  de  Investigacion  en  Compute,  Institute  Politecnico  Na- 
cional  -  Me.xico  D.F.  This  work  was  supported  by  the  Institute  de  Cooperacion 
Iberoamericana  (ICI),  Consejo  Nacional  de  Ciencia  y  Tecnologia  (CONACYT). 

This  work  was  supported  by  the  Ministry  of  Education  of  Spain  under  contract 
0429/95,  and  by  the  CEPB.4. 
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when  possible.  Conventional  vector  architectures  have  used  large  vector  registers 
as  one  of  the  principals  resources  to  hide  latency.  When  a  vector  instruction  is 
started,  it  pays  for  some  initial  (potentially  long)  latency,  but  then  it  works  on  a 
long  stream  of  elements  end  effectively  amortizes  this  latency  across  all  elements. 

Taking  into  account  this  point  of  view,  we  can  understand  why  that  vector 
machines  have  been  designed  with  vector  registers  as  large  as  po.ssible.  Unfortu¬ 
nately  large  registers  have  several  disadvantages  : 

•  When  the  application  can  not  make  full  use  of  the  vector  register  size,  a 
precious  hardware  resource  is  being  wasted  [1,  2]. 

•  Large  registers  means,  big  number  of  transistors  and  expensive  cost;  this 
implies  that  only  a  few  of  them  can  be  implemented  on  the  design. 

•  If  the  number  of  registers  that  the  compiler  sees  is  small,  then  the  amount 
of  spill  code  introduced  to  support  all  live  variables  is  considerably  [5]. 

Reducing  the  vector  registers  length  is  certainly  a  solution  to  the  problems 
just  outlined.  If  most  applications  can  not  fully  use  all  elements  present  in  each 
vector  register  then,  reducing  the  vector  register  length  will  reduce  cost  and 
increase  the  fraction  of  usage  of  registers.  The  drawback  of  register  length  re¬ 
duction  is  the  associated  performance  penalty.  Each  time  a  vector  instruction 
is  executed,  its  associated  latencies  are  amortized  over  a  smaller  number  of  ele¬ 
ments.  This  can  have  a  significant  negative  impact  on  performance,  especially  for 
memory  accesses.  Moreover,  more  instructions  have  to  be  executed  each  with  a 
shorter  effective  length,  and,  therefore,  the  number  of  times  that  latencies  must 
be  payed  is  larger. 

Unless  some  extra  latency  tolerance  mechanism  is  introduced  in  a  vector 
architecture,  vector  length  can  not  be  reduced  without  a  severe  performance 
penalty.  W'hile  many  techniques  have  been  developed  to  tolerate  memory  latency 
in  superscalar  processors,  only  a  few  studies  have  considered  the  same  problem 
in  the  context  of  vector  architectures  [3,  4,  5]. 

In  this  paper  will  study  the  influence  of  the  vector  register  size  over  two 
different  concepts  of  vector  architectures,  on  a  conventional  vector  architecture 
and  on  an  out-of-order  vector  machine.  We  will  present  data,  confirming  that 
we  can  not  reduce  the  vector  register  size  on  a  conventional  vector  architecture 
without  suffpriiii;  a  s.nere  performance  penalty.  We  will  show  that  combining 
an  out-of-ordrr  ^xt'cution  and  short  registers,  the  performance  degradation  is 
quite  small  than  ih**  observed  on  a  conventional  vector  machine.  We  have  ob¬ 
served  that  thi*  '■"iiitunation  allows  not  only  the  vector  register  reduction  with 
a  good  perfnrmaii'-^  l«ut  also  when  comparing  the  performance  between  both 
architectures  ih'-  t '•rformance  of  the  new  out-of-order  vector  machine  is  much 
better. 
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Fig.  1.  Percentage  of  full  stripes  for  different  vector  register  sizes 


2  Vector  Registers  Usage 

In  this  section  we  will  investigate  the  relationship  between  the  next  two  param¬ 
eters  : 

•  V'ector  Register  Size  (VRZ). 

•  Benchmark  Programs. 

High  memory  latencies  are  common  in  vector  architectures.  In  order  to  hide 
that  latency,  large  vector  registers  have  been  a  norm  in  the  design  of  this  kind  of 
architectures.  This  point  of  view  is  correct,  but  unfortunately,  with  large  vector 
registers  not  everything  is  positive  ; 

•  Large  registers  mean  large  hardware  space  and  more  cost.  The  Designer, 
normally,  includes  just  few  of  them  (eg.  8  or  16  with  128  element  each). 

•  Having  few  registers,  it  is  a  drawback  for  the  compiler  because  the  quality 
of  the  code  that  it  can  generate  is  quite  poor. 

We  have  seen  [1]  that,  when  a  machine  has  large  registers,  programs  do  not 
make  u.se  of  their  hardware.  Many  people  are  researching  over  ne^v  algorithms  in 
order  to  execute  their  calculus  as  fast  as  possible,  physics,  chemistry,  mathemat¬ 
ics,  and  so  on,  field  where  this  kind  of  architectures  still  excel.  The  algorithms 
characteristic  are  quite  varied  and  the  different  architectures  are  trying  to  apply 
all  their  capacity,  but  some  times  the  data  structures  from  the  applications  are 
like  a  barrier. 


497 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


In  order  to  know  how  a  set  of  applications  make  use  of  the  register  file  on 
a  vector  architecture,  we  have  done  the  following.  Having  a  set  of  registers, 
where  each  register  has  as  a  maximum  VL  elements,  we  have  executed  our  set  of 
programs,  using  four  possible  values  for  the  vector  register  size;  16,  32,  64  and 
128  elements. 

Figure  1  presents  the  percentage  of  full  stripes  of  a  program  set.  If  we  have 
an  architecture  where  the  VL  maxim  could  be  128,  and  the  structure  of  the 
programs  permit  the  entire  use  of  this  available  hardware,  we  will  say  that  in 
this  case  we  have  a  full  stripe. 

Now,  if  we  consider  a  maximal  vector  register  size  of  64  elements  and  the 
program  allows  the  use  of  bigger  registers,  then  instructions  would  “translate" 
into  two  instructions  that  could  operate  on  64  elements  each  one.  For  example, 
the  figure  1  shows  how  in  most  cases  less  than  50%  of  all  executed  vector  instruc¬ 
tions,  used  a  vector  register  of  size  128.  When  the  vector  register  size  was  16 
elements,  almost  85%  of  all  executed  vector  instructions  used  full  stripes  except 
the  program  dyfesm. 

As  we  have  expected,  there  is  a  strong  dependence  between  the  whole  per¬ 
formance  and  the  program  executed  to  get  it.  We  have  observed. that,  if  an 
architecture  have  a  long  register,  it  does  not  mean  that  the  applications  will 
make  total  use  of  its  resource.  In  most  cases  (for  our  applications)  we  will  have 
better  register  usage  when  the  vector  registers  are  smaller. 

We  know,  from  [6],  that  a  reduction  of  the  vector  registers  on  a  conventional 
vector  architecture  must  be  enclosed  by  a  technique  which  could  hide  that  re¬ 
duction,  in  order  to  keep  or  in  the  best  of  the  cases  improve,  the  performance. 

3  Reducing  Vector  Registers  Length 

The  architecture  and  compiler  are  reflected  in  the  characteristics  of  the  code  that 
these  could  generate  from  an  application.  If  these  are  an  intelligent  pair,  it  could 
be  easy  to  obtain  programs  which  use  different  vector  register  sizes;  sections  of 
a  register,  where  each  section  could  be  considerate  a  independent  register.  The 
Fujitsu  VPP500  [7]  is  an  example  of  that  kind  of  architectures.  The  VPP500 
has  a  vector  register  file  organized  as  256  registers  and  each  register  has  64 
elements  (8  bytes  each).  Different  register  file  configurations  can  be  possible, 
from  256  registers  of  64  elements  each  until  8  registers  of  2048  elements  each. 
For  our  purposes,  this  lower  limit  size  (64  elements)  is  not  enough,  because  we 
want  to  study  shorter  vector  register,  in  order  to  have  better  register  usage  (see 
section  2). 

Unfortunately,  most  of  vector  architectures  does  not  have  the  VPP500  vector 
register  reorganization.  Our  reference  architecture  falls  into  this  category. 

The  procedure  that  we  have  followed,  in  order  to  obtain  a  set  of  binaries 
(from  benchmarks)  a.ssuming  different  vector  register  lengths,  is  the  following; 

•  For  each  program,  we  searched  all  the  highly  vectorized  loops,  with  the  helji 

of  the  compiler  information. 
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D0  4IIJ=2JL 
DO  4<l  Ie2.il 

DW(IJ,n  =  DW(U,l)  +FW(U,1) 
DW(UJ)eDW(U.2)  +FW(IJ,1) 
DW(I  J,.1)  =  DW(1  J.1)  +FW(IJ..1) 
DW(1,J,4)=DW(IJ,4)  +FW(U,4) 
4(1  CONTINUE 


(a) 


D0  4«J=2JL 

DO  4«  STRIPV=2,1L,VLZ 
C$DIR  MAX  TRIPS(.12) 

DO  4()  1=STRIPV.M1N(IL.STRIPV+VLZ) 
DW(IJ.I)  =  DW(IJ,1)  ♦FWCU.l) 
DW(U,2)  =  DW(I  J.2)  +FW(I  J.2) 
DW(1J-D  =  DW(U,.^)  +FW(U..1) 
DW(1J,4)=DW(IJ,4)  +FW(U.4) 

4(1  CONTINUE 

(b) 


Fig.  2.  (a)  Flo52  loop  without  Strip-Mining,  (b)  Aiiding  Strip- mining. 


•  First,  we  manually  modified  the  benchmark  sources  and  then,  we  manually 
added  strip-mined  loop  (see  figure  2)  performing  steps  of  desired  length  VLZ 
(vector  length  size). 

•  In  this  way,  we  constructed  four  different  configurations  for  each  source  pro¬ 
gram  using  VLS=16,  32,  64  and  128  elements  by  register. 

After  applying  this  technique,  we  can  notice  that,  the  architecture  sees  more 
scalar  and  vector  instructions.  The  vectorizable  loop,  will  need  more  iterations  to 
complete  the  same  number  of  vector  operations  and  due  to  the  scalar  operations 
are  inside  the  loop,  these  are  executed  more  times. 

In  the  next  section,  we  will  describe  the  vector  architectures  examined  in  this 
study  and  then;  we  will  show  the  performance  reached  by  each  one. 

4  Vector  Architectures  and  Simulations  Tools 

In  this  section,  we  describe  the  main  characteristics  of  the  architectures  evaluated 
in  this  work.  First,  we  will  show  the  reference  vector  architecture  used  as  a 
baseline.  Second,  we  will  introduce  the  out-of-order  vector  architecture  used. 
Finally,  we  will  describe  the  tools  used  to  generate  traces  and  for  simulating 
each  architecture. 


4.1  The  Baseline  Architecture 

We  have  used  a  ma<-|im»-  loosely  based  on  a  Convex  C3400  [8],  as  a  ba.seline 
vector  architecture  though  this  machine  is  a  multiprocessor  architecture, 

our  work  assumes  a  iinipri  >.  ♦-.sor  vector  machine. 

Figure  3  show  a  basic  .b-scription  of  a  C3400. 

•  Scalar  Unit 

-  The  scalar  unii  '  v.  -  n.-s  all  instructions  that  involve  scalar  registers  (A 
and  S  registers'  >.j  subtracts  compares,  shifts,  logical  operations  and 
integer  converi'  An  I  411  issues  a  maximum  of  one  instruction  per  cycle. 

-  The  scalar  unit  b  ('  •  itM  32  bits  address  registers  and  eight  64  bit  scalar 
registers. 

-  This  unit  has  a  l''K!  lata  cache,  with  32  bytes  line  size. 
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Fig.  3.  The  reference  vector  architecture. 


•  Vector  Unit 

-  The  vector  unit  consists  of  two  computation  units  (FUl  and  FU2)  and 
one  memory  accessing  unit.  The  FU2  unit,  is  a  general  purpose  arith¬ 
metic  unit  capable  of  executing  all  vector  instructions.  The  FUl  unit,  is 
a  restricted  functional  unit  that  executes  all  vector  instructions  except 
multiplication,  division  and  square  root. 

-  The  vector  unit  has  8  vector  registers,  grouped  in  pairs.  Each  register 
holds  up  to  128  elements  of  64  bits  each.  Each  group  share  two  read 
ports  and  a  write  port,  that  link  them  to  the  functional  units. 

•  Requesting  memory  is  done  through  only  one  data  bus  (Loads  and  Stores). 

•  The  reference  machine  implements  vector  chaining,  from  functional  units  to 
other  functional  units  and  to  store  unit.  Memory  load  does  not  chain  with 
any  functional  unit. 


4.2  The  Out-of-order  Vector  Architecture 


For  our  simulations  we  u.sed  the  out-of-order  vector  architecture  introduced 
in  [5].  The  out-of-order  and  renaming  version  of  the  reference  architecture  is 
shown  in  figure  4.  It  has  the  same  computing  capacity  as  the  reference  machine 
but  it  is  extended  to  use  a  renaming  technique  very  similar  to  that  found  in 
the  RIOOOO  [9].  Me  will  refer  to  this  architecture  as  ‘000’.  Instructions  flow 
in-order  through  the  Fetch  and  Decode/Rename  stages  and  then  go  to  one  of 
the  four  queues  present  in  the  architecture  based  on  instruction  type.  At  the 
rename  stage,  a  mapping  table  translates  each  virtual  register  into  a  physical 
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Fig.  4.  The  Out-of-Order  vector  architecture  studied  in  this  paper. 


register.  There  are  4  independent  mapping  tables,  one  for  each  type  of  register; 
A,  S,  V  and  mask  registers.  Each  mapping  table  has  its  own  associated  list  of 
free  registers.  When  instructions  are  accepted  into  the  decode  stage,  a  slot  in  the 
reorder  buffer  is  also  allocated.  Instructions  enter  and  exit  the  reorder  buffer  in 
strict  program  order.  When  an  instruction  defines  a  new  logical  register,  a  phys¬ 
ical  register  is  taken  from  the  free  list,  the  mapping  table  entry  is  updated  with 
the  new  physical  register  number  and  the  old  mapping  is  stored  in  the  reorder 
buffer  slot  allocated  to  the  instruction.  When  the  instruction  commits  the  old 
physical  register  is  returned  to  the  free  list. 

The  A,  S  and  V  queues  monitor  the  ready  status  of  all  instructions  held  in 
the  queue  and  as  soon  as  one  instruction  is  ready,  it  is  sent  to  the  appropriate 
functional  unit  for  execution.  All  instruction  queues  can  hold  up  to  16  instruc¬ 
tions.  The  machine  has  a  64  entry  BTB,  where  each  entry  has  a  2-bit  saturating 
counter  for  predicting  the  outcome  of  branches.  Both  scalar  register  files  (A  and 
S)  have  64  physicals  registers  each.  The  mask  register  file  has  8  physical  registers. 
The  fetch  stage,  the  decode  stage  and  all  four  queues  only  process  a  maximum 
of  1  instruction  per  cycle.  Committing  instructions  proceeds  at  a  faster  rate,  and 
up  to  4  instructions  may  commit  per  cycle.  The  functional  unit  latencies  of  the 
architecture  are  very  similar  to  the  RIOOOO  ones.  See  [5]  for  further  details  of  the 
architecture. 

The  most  important  aspect  of  the  architecture  when  considering  final  perfor¬ 
mance  is  the  number  of  physical  vector  registers  available  for  renaming  vector 
instructions.  In  [5]  it  is  shown  that  16  physical  vector  registers  is  the  optimum 
point  that  maximizes  performance  at  a  reasonable  cost.  Unless  otherwise  stated, 
we  will  use  16  physical  vector  registers  for  our  simulations.  In  section  5,  we  will 
vary  the  number  of  physical  vector  registers  from  16  to  32  and  to  64  to  study 
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how  the  number  of  physical  registers  interacts  with  the  length  of  each  register. 

As  we  did  for  the  traditional  machine,  we  define  four  different  versions  of 
the  000  architecture,  each  having  a  different  vector  register  length.  The  four 
versions  will  be  referred  to  as  the  000128,  00064,  00032  and  00016  archi¬ 
tectures  and  will  have  a  vector  length  of  128,  64,  32  and  elements  respectively. 


4.3  Simulations  Tools 

For  our  simulations,  we  have  used  a  trace-driven  simulations  to  generate  all  the 
data,  that  we  will  show. 

We  have  used  a  pixie-like  tool  called  Dixie  [10]  that  is  able  to  produce  a  trace 
of  basic  blocks  executed  as  well  as  a  trace  of  the  values  contained  in  the  vector 
length  (vl)  register  and  Jinks  [11]  a  parameterizable  simulator  that  implements 
the  reference  architecture  model  before  described.  The  ability  to  trace  the  value 
of  the  vector  length  register  is  critical  to  have  a  detailed  simulation  of  the  program 
execution. 


5  Performance 

Using  the  binaries  gathered  (see  section  3),  we  will  study  different  variations 
of  our  vector  architectures.  For  each  binary  (program),  we  have  eight  differ¬ 
ent  configurations.  The  difference  among  each  program  is  the  maximal  vector 
register  size  allow  to  use.  The  eight  models  under  study,  will  be  referred  to  as 
the  REF128,  REF64,  REF32,  REF16,  000128,  00064,  00032  and  00016. 
where  128,  64,  32  and  16,  are  the  vector  register  size  used  by  each  model. 

Both  architectures  have  the  same  number  of  logical  registers,  that  means 
that  the  same  code  was  introduced  in  both  architectures.  But,  because  the  o-o-o 
architecture  implements  renaming,  it  uses  a  total  of  16  physical  registers,  which 
are  invisible  for  the  compiler  and  for  the  user. 

We  will  cover  two  points  in  this  section.  For  three  different  latencies  of  1,  50 
and  100  cycles,  we  will  show: 

•  How  each  architecture  tolerates  the  vector  register  reduction  plus  memory 
latencies  effect. 

•  The  performance  of  each  architecture,  using  different  vector  register  sizes 
(Speed-Up). 

5.1  Reference  Architecture 

In  Figure  5,  we  can  see  the  effect  of  reducing  vector  register  sizes  on  the  reference 
vector  architecture. 

In  this  figure,  we  have  selected  the  REF  128  as  a  baseline  in  order  to  studv 
the  register  reduction  effect.  Using  one  cycle  latency  and  register  sizes  of  128 
and  64,  the  behavior  seems  to  be  constant,  an  ideal  vector  architecture  behavior. 
When  we  reduce  the  register  size  and  we  use  a  more  real  memory  latency,  of  50 


502 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


U  L=  I 
■  Sit 
•  L=  U»t 


Fig.  5.  Effects  of  memory  latency  and  vector  register  size  on  a  Conventional 
Vector  Architecture.  X-axis  is  memory  latency 


and  100  cycles,  the  effect  is  clearly  negative.  It  is  most  remarkable  when  the 
memory  latency  is  bigger  and  the  vector  registers  are  shorter. 

Even  though,  the  architecture  uses  large  registers  {REF128),  the  performance 
degradation  is  quite  important.  The  slowdown  degradation  can  take  values  from 
3.1-1. 7.  This  is  an  important  point  to  emphasize  because  large  registers  on  vector 
architectures  have  been  once  of  the  best  tools  used  to  attack  memory  latency, 
but  we  can  see  that  it  is  not  sufficient. 

If  we  compare  the  REF128,  with  the  other  configurations,  REF64,  REF32 
and  REF  16  the  slowdown  can  reach  up  to  3.5. 

This  behavior  is  not  a  surprise,  and  as  we  expected,  reducing  the  vector 
register  size  on  a  conventional  vector  architecture  can  be  a  quite  negative  factor. 

5.2  OOO  Vector  Architecture 

Figure  6,  shows  the  vector  register  reduction  effect  but  now  on  the  out-of-order 
vector  architecture.  Again,  the  baseline  is  the  best  configuration,  in  this  case 
is  000128.  Clearly  we  can  observe  that,  this  architecture  has  better  vector 
reduction  tolerance.  Reducing  the  vector  register  size  up  to  1/4  (from  128  to 
•32),  line  00032,  the  execution  time  is  degraded  by  an  factor  of  1.0-1. 5. 

When  we  evaluated  the  memory  latency  effect,  we  saw  that,  the  000128, 
00064  o,nd  00032,  in  most  cases  (programs  swm256,  hydro2d,  arc2d.  nasal, 
iomcaiv,  bdna)  have  a  very  good  memory  latency  tolerance,  with  slowdown  in  the 
range  1.0-1. 3.  Other  programs,  such  as  flow52,  trfd,  dyfesm.  and  su2cor,  do  not 
have  good  behavior  using  short  registers,  but  it  is  still  better  than  the  tolerance 
showed  by  the  reference  architecture,  with  slowdowns  in  the  range  1.22-1.98. 
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Fig.  6.  Effects  of  memory  latency  and  vector  register  length  on  a  Oui-of-order 
Vector  Architecture.  X-axis  is  memory  latency. 


Until  this  point  we  can  conclude  that  if  an  architecture  uses  advanced  ILP 
techniques  like  an  out- of- order,  it  will  be  able  to  tolerate  the  vector  register 
reduction  better,  even  across  large  latency  range. 


5.3  Performance  Comparison 

In  this  section  we  will  present  a  comparison  performance  between  both  archi¬ 
tectures.  We  will  make  this  comparison  using  the  same  or  less,  register  file  size. 
That  is  REF128  versus  00016,  00032  and  00064. 

Figure  7,  plots  the  simulated  performance  using  three  different  memory  laten¬ 
cies.  For  each  program,  each  configuration  and  each  value  of  memory  latency,  we 
compute  the  speedup  relative  to  the  performance  of  the  REF128  configuration 
at  latency  1. 

We  can  observe  in  Figure  7  that,  using  the  same  register  file  size,  8Kb. 
REF128  and  00064  lines,  the  performance  over  the  REF128  is  much  bet¬ 
ter  for  all  the  programs  and  all  the  memory  latencies,  with  speedups  in  the 
range  of  1.09-1.4- 

Even  reducing  the  register  file  size  (on  00064)  up  to  1/2,  line  00032,  it  is 
still  better  than  the  reference  machine  with  large  registers,  for  all  programs  and 
all  memory  latencies,  with  speedups  in  the  range  of  1.04-1.34. 

Nevertheless,  when  reducing  the  size  up  to  1/4  on  00064,  ( 00016  line),  the 
performance  of  the  o-o-o  machine  is  not  always  better  than  the  REF128.  The 
programs  hydro2d,  flo52,  iomcatv,  bdna  and  trfd,  show  better  performance  than 
the  reference  machine  with  speedups  in  the  range  of  1.03-1.1.  Four  programs, 
namely  swm2-56.  arc2d  and  nasa7,  have  performance  that  is  slightly  better  or 
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Fig.  7.  Performance  comparison  of  the  000  architecture  and  the  Reference 
Architecture  using  the  same  or  less,  register  file  size.  X-axis  is  memory  latency 
in  cycles  and  Y-axis  represent  SpeedUp. 


slightly  worse  than  the  REF128,  but  the  difference  is  typically  around  the  87o. 
And  finally,  the  worse  case  was  the  performance  of  the  program  sxi2cor,  with  a 
slowdown  around  40%- 


6  Summary 

In  this  paper  we  have  studied,  the  influence  of  reducing  the  vector  register  size, 
over  two  different  concepts  of  vector  architectures. 

The  in  order  execution,  traditionally  used  on  vector  architectures,  and  the 
long  latencies  payed  on  a  memory  request,  have  been  always  used  with  the  use 
of  long  vector  registers  in  order  to  hide  and  amortize,  this  latency  and  this  strict 
program  order.  Nevertheless,  we  have  showed  that  long  registers  were  rarely  fully 
used  for  a  set  of  highly  vectorizable  programs.  Less  than  40%  of  all  the  registers 
being  used  are  completely  filled  with  128  elements  of  data. 

As  expected,  reducing  the  vector  register  length  on  a  traditional  vector  ma¬ 
chine  results  in  a  remarkable  loss  of  performance.  The  cost  savings  is  clearly 
out-weighted  by  the  execution  time  degradation.  Halving  the  vector  length  yields 
slowdowns  in  the  range  of  1.1-3. 5.  Unless  some  latency  tolerance  technique  is 
added  to  a  traditional  vector  machine,  vector  register  length  should  be  kept  as 
long  as  possible. 

We  have  used  an  ILP  technique,  out-of-order  execution,  in  order  to  reduce  the 
need  for  very  large  vector  registers  without  a  remarkable  lost  on  performance. 
Simulations  show  that  when  the  out-of-order  execution  is  exploited,  is  possible 
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reduce  the  vector  register  size  up  to  1/4,  without  a  considerable  degradation  in 
performance  (slowdowns  of  1.0-1. 5). 

Finally  we  have  compared  the  performance  between  architectures,  where  the 
out-of-order  vector  architecture  used  the  same  or  less,  register  file  size  than 
the  baseline  architecture.  Simulations  showed  that,  using  an  out-of-order  it  is 
possible  to  reduce  the  size  of  each  vector  register  up  to  4Kb  (REF128/4)  with  a 
better  performance  (speedups  of  1.04-1-34)  than  the  conventional  architecture 
and  up  to  2Kb  (REF128/8),  with  speedup  in  the  range  0.9-1. 3. 

With  this  work  we  showed  that,  when  ILP  is  exploited  using  out-of-order 
architecture,  the  need  for  very  large  vector  registers,  as  we  noted  in  our  pre¬ 
vious  studies,  it  is  substantially  reduced.  The  vector  register  reduction  can  be 
used  in  several  different  ways:  either  to  decrease  processor  cost  by  reducing  the 
total  amount  of  storage  devoted  to  register  values  or  to  improve  performance 
by  more  effectively  using  the  available  storage.  Using  out-of-order  execution  and 
short  register,  the  vector  architecture  concept  like  a  big  and  expensive  super¬ 
computers  could  change,  because  designers  could  use  the  actual  technology  and 
ideas  (caches,  memory  systems,  no  blocking  loads,  Clustering,  etc.)  in  order  to 
improve  the  performance. 
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Abstract.  An  ORTHOMIN(/t)  algorithm,  a  tnmcated  version  of  GCR 
(genercdized  conjugate  residual)  algorithm  proposed  by  Eisenstat  et  al.  [4], 
has  been  widely  used  for  solving  large  emd  speirse  nonsymmetric  hnear 
systems  of  equations  Ax  =  b.  In  order  to  accelerate  the  convergence  of 
the  ORTHOMIN(fc)  method,  we  generally  use  a  restart  technique.  But,  it 
is  not  so  easy  to  find  out  the  restarting  timing  of  its  algorithm.  In  this  pa¬ 
per,  we  will  propose  new  adaptive  restarted  procedure  which  will  find  the 
restart  timing  of  the  ORTHOMlN(fc)  automaticadly.  At  last,  numerical 
experiments  are  reported  that  demonstrate  the  efficacy  of  the  adaptive 
restated  procedure  combined  with  the  ORTHOMIN(fc)  algorithm  on  a 
distributed  memory  parallel  machine  APIOOO. 


1  Introduction 

In  this  paper,  we  consider  the  iterative  solution  of  large  and  sparse  linear  systems 
of  equations 

Ax  =  b  (1) 

in  which  the  coefficient  A  is  a  non-singular  n  x  n  matrix  and  6  is  a  given  n- 
vector.  To  simplify  in  this  paper,  we  will  presume  A  and  6  to  be  a  real  and 
large  nonsymmetric  matrix.  The  class  of  non-stationary  iterative  methods  is 
characterized  by  the  fact  that  update  for  the  residual  vector  is  computed  sep¬ 
arately  from  the  current  approximation  to  the  solution.  A  major  class  of  these 
methods  is  Krylov  subspace  or  conjugate  gradient  type  algorithms,  like  GCR 
(generalized  conjugate  residual)  [4],  GMRES  (generalized  minimal  residual)  [5], 
BiCG  (bi-conjugate  gradient)  [2],  and  BiCGStab  (bi-conjugate  gradient  stabi¬ 
lized)  [8,  11,  17,  18]. 

The  ORTHOMIN(A;)  algorithm  [1]  is  the  important  variant  of  GCR  algo¬ 
rithm  [4].  This  algorithm  converges  very  quickly  under  certain  condition  among 
the  GCR  algorithm’s  family.  However,  in  some  case,  the  residual  of  the  ORTHO- 
MIN(fc)  algorithm  may  not  have  a  faster  convergence.  So  we  present  an  adaptive 
restarted  procedure  on  the  ORTHOMIN(A:)  algorithm,  principally  the  combined 
algorithm  can  be  better  deal  with  a  faster  convergence.  The  adaptive  restarted 
procedure  with  the  PRES  (pseudo  residual)  [9]  algorithm  was  primarily  proposed 
by  Inadu  and  Nodera  [13,  16].  In  this  paper,  the  adaptive  restarted  procedure 
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for  the  ORTHOMIN(fc)  algorithm  will  be  proposed,  and  it  will  be  recognized  to 
decide  the  restart  timing,  automatically. 

This  paper  is  organized  as  follows.  In  section  2,  we  briefly  review  the  ORTHO- 
MIN(A:)  algorithm  and  its  associated  properties.  In  section  3,  we  show  the  main 
idea  on  which  the  adaptive  restarted  procedure  for  the  ORTHOMIN(Jt)  algo¬ 
rithm.  In  section  4,  we  report  the  numerical  experiments  to  show  the  convergence 
behavior  of  the  adaptive  restarted  ORTHOMIN(/k)  algorithm  on  the  MIMD  par¬ 
allel  machine  APIOOO,  followed  by  some  concluding  remarks  in  section  5. 

2  Review  of  ORTHOMIN(fc) 

One  kind  of  the  most  successful  scheme  is  based  on  the  orthogonal  projec¬ 
tion,  typified  by  GCR  [4]  (generalized  conjugate  residual)  or  ORTHOMIN  [1,  4] 
and  ORTHODIR  [3,  11]  or  ORTHORES  [9,  12]  and  GMRES  [5]  algorithm.  The 
GCR  algorithm  is  mathematically  equivalent  to  GMRES  algorithm.  The  GCR 
algorithm  begins  with  the  initial  approximate  solution  xq  and  initial  residual 
ro  =  b  —  Axf)  and  characterizes  fcth  approximate  solution  as  x/,  =  xo  +  Zk,  where 
Zfc  solves 

min  ||6  -  A(xo  -f  z)\\2  =  min  ||ro  -  Az\\2. 

Here,  AC*  is  the  A:th  Krylov  subspace  determined  by  the  coefficient  matrix  A  and 
ro,  which  defined 


Kk  =  span{ro,  Atq,  A^ro, ....  Vo}. 

In  some  sense,  GCR  algorithm  finds  the  best  approximate  solution  in  the  Krylov 
subspace.  In  contrast  to  the  BiCG  like  algorithms  [11,  8,  17]  based  on  the  Lanc- 
zos  process,  GCR  algorithm  uses  long  recurrences.  This  work  and  storage  per 
step  grows  drastically  as  the  number  of  steps  increase  and  the  algorithm  be¬ 
comes  impractical  for  lots  of  iterations.  As  a  consequence,  we  must  restart  this 
algorithm  in  practice,  which  may  results  in  very  slow  convergence.  In  order  to 
overcome  this  advantage  of  the  long  recurrences,  a  popular  technique  is  to  re.sort 
to  truncated  strategies.  It  uses  only  a  few,  say  k,  rather  than  all  the  vectors  gen¬ 
erated  previously  in  recurrences  to  get  the  next  vectors  and  can  be  significantly 
less  expensive  at  each  restart. 

The  ORTHOMIN(fc)  algorithm,  primarily  proposed  by  Vinsome  [1]  as  a  trun¬ 
cated  version  of  the  GCR  algorithm.  Figure  1  displaies  the  standard  ORTHO- 
MIN{fc)  algorithm  without  correction.  In  this  algorithm,  the  direction  vector 
update  can  be  truncated  so  that  at  most  k  n  previous  direction  vectors  are 
used  after  iteration  k. 

i-l 

Pi  =  ri+  (2) 

j=i-k 

In  this  case,  the  x,  +  i  is  local  minimum,  the  point  in 

x,_*  -I-  span{p,_*,  ...,pi] 
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1;  Choose  xo- 
2:  ro  =  b  —  Axo 
3:  for  t  =  0,  1,  2,  . . . 

3.1:  if  i  =  0  then 
3.1.1:  po  =  ro 
else 

3.1.2:  for  j  =  <T,  cr  +  1,  . . . ,  t  —  1 

3.1.2.1:  =  -(An,  Apy)l(Ap,,  Apj) 

endfor 

3.1.3:  p,  =  r,  + 
endif 

3.2:  a,  =  (r,,  Ap,)/(Ap,,  Api) 

3.3:  Xi+i  =  li  +  Oip, 

3.4:  r,4i  =  r,  -  Oiy4pi 

3.5:  If  converge,  escape  the  loop. 

endfor 

Where,  o  =  max{0,  i  —  k} 


Fig.  1.  The  ORTHOMIN(Jk)  algorithm 


w'hose  residual  norm  ||r,+i||2  is  minimized. 

The  following  theorem  was  proposed  by  Eisenstat  et  al.  [4] . 


[Theorem  2.1]  Let  M  =  (A -{■  A'^)j2  denote  the  symmetric  part  of  A,  and 
R  =  (A  —  A'^)I2  denote  the  skew- symmetric  part  of  A.  When  M  is  positive 
definite,  residuals  generated  by  the  ORTHOMlN(k)  method  fulfill  the  following 
relation. 


'’.'2 


< 


1  - 


.(M)^ 


Amin(M)A,„axW+/2(i?)2j 


i/2 


ikoll 


2- 


where  Amin(M)  and  Ani,<(.\/)  imply  the  smallest  and  largest  eigenvalues  of  M , 
respectively.  Also,  p(R'^  dr  notes  the  spectral  radius  of  R. 

This  theorem  states  that  the  residual  norm  of  the  ORTHOMIN(fc)  algorithm 
is  decreased  in  every  iteration  steps.  Namely,  we  will  get  the  approximate  solution 
by  using  this  algorithm  In  practice,  we  have  found  that  even  if  this  bound  to  be 
pessimistic,  this  algorithm  i»  an  effective  solution  technique  for  large  and  sparse 
nonsymmetric  matrix  pr'  -t'lems  This  algorithm  is  very  easy  to  implement,  but  in 
some  case  the  ORTHOMIN'  *  algorithm  slows  down  the  convergence  of  residual 
norm.  In  this  case,  wp  makr  ihp  choice  of  new  starting  vector  and  then  restarts 
the  algorithm  once  again  s<  >  an  suitable  restarting  is  usually  necessary  for  this 
algorithm  to  make  the  ar.  ••irrai  ion  of  convergence  of  residual.  In  the  next  section, 
we  devote  to  the  study  of  automatic  restart  of  the  ORTHOMIN(fc)  algorithm  in 
adaption. 
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3  The  adaptive  restarted  procedure 

The  restarts  of  ORTHOMIN(/:)  algorithm  are  ordinarily  needed  to  reduce  for  the 
round  off  errors  and  the  amount  of  the  necessary  computational  time  to  satisfy 
the  convergence  criterion.  However,  so  many  restarts  slow  down  the  convergence 
of  the  ORTHOMIN(^)  algorithm.  So  the  suitable  restart  of  this  algorithm  can 
be  accelerated  the  convergence  of  the  residuals.  We  have  designed  an  adaptive 
procedure  with  the  automatic  restart  for  the  ORTHOMIN(Ar)  algorithm. 

The  adaptive  restarted  procedure,  which  was  proposed  by  Inadu  and  Nodera 
[13,  16],  is  the  technique  which  is  introduced  to  the  ORTHORES(it)  algorithm 
for  solving  the  large  sparse  sets  of  nonsymmetric  linear  systems  of  equations. 
The  ORTHO'RES(fc)  algorithm  belongs  to  the  class  of  the  pseudo  residual  al¬ 
gorithm  [9,  10].  This  technique  improves  the  convergence  of  ORTHORES(fc) 
method  by  using  the  restart  of  its  algorithm,  appropriately.  In  order  to  work  this 
approach  effectively,  we  need  to  find  out  the  timing  of  performing  the  restart. 
For  the  pseudo  residual  algorithm,  we  decided  the  timing  of  the  restart  from  the 
two  points  of  view;  one  is  the  observation  of  oscillating  residual  norm,  and  the 
another  is  the  observation  of  the  scalar  coefficients  of  the  ORTHORES(fc)  algo¬ 
rithm.  On  the  other  hand,  the  ORTHOMIN(/t)  algorithm  has  a  good  property 
which  minimizes  the  residual  norm.  Therefore,  we  consider  to  use  a  different 
strategy  that  does  find  out  about  the  timing  of  restart  for  the  ORTHOMIN(A:) 
algorithm,  adaptively. 

The  ORTHOMIN(A;)  algorithm  has  very  slow  convergence  behavior,  when 
the  scalar  |Q,  j  is  the  smallest  enough.  One  of  the  reasons  that  the  degree  of 
the  direction  polynomials  does  not  come  up  higher  order.  So  in  order  to  im¬ 
prove  the  convergence  of  its  residual,  we  consider  the  timing  of  the  restart  of 
ORTHOMIN(fc)  algorithm,  which  is  based  on  the  scalar  |oj|.  Also,  the  scalar 
ja,|  has  a  meaning  called  the  distance  that  proceeds  along  a  direction  vector.  In 
fact,  while  the  norm  of  residual  decreasing  sharply,  we  have  a  property  that  the 
scalar  |a,  |  stalls  at  the  small  value.  Let  us  consider  about  the  execution  of  the 
adaptive  restart  with  the  following  rule  of  the  determination  of  the  timing. 

(1)  Rule  of  deciding  the  timing  of  restart 

When  the  scalar  |!a,Ap,||/||ri||  is  even  small  more  than  the  parameter  given 
e  in  advance,  we  are  not  able  to  expect  a  faster  convergence  of  the  ORTHO- 
MIN(^;)  method  in  the  continuous  iteration  of  k  steps,  and  then  we  consider 
to  do  the  restart.  In  fact,  while  the  restart  is  difficult  to  be  executed  for 
smaller  value  of  parameter  e,  the  restart  is  easy  to  be  executed  for  the 
larger  value  of  the  parameter  e.  We  have  shown  that  the  adaptive  restarted 
procedure  stabilized  to  the  many  problems  around  the  parameter  e  =  1.0,  as 
the  results  of  numerous  experiments  coming  from  the  discretization  of  the 
boundary  value  problem  of  partial  differential  equation,  etc. 

For  the  next  iteration  steps  of  the  ORTHOMIN(fc)  method  after  performed 
the  restart,  we  expect  that  the  scalar  ||a,Ap,||/||r,||  becomes  the  larger  value. 
While  for  smaller  value  of  ||Q,Ap,||/|);’,||  we  performed  the  restart  tentatively, 
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Choose  xo  and  e. 

Omax  =  0i  adapt  jrestart  :=  on  . (2-a) 

To  =  6  —  Axo  1  k.count  =  0 
for  i  =  0,  1,  2,  . . . 

4.1:  Calculate  a,  euid  p, ,  using  ORTHOMIN(/:)  method. 

X,+  l  =  Xi  +  Oipi 
Ti+i  =  n  -  oiApi 
If  converge,  escape  the  loop. 
q'  =  ||a,vlp.||/l|ri|| 

If  a[  >  alnax,  then  adapt  jrestart  :=  on . (2-b) 

if  q'  <  £  then 
4.7.1:  k.coimt  =  k-count  +  1 
else 

4.7.2:  k_coimt  =  0,  adapt  jrestart  :=  on . (2-d) 

endif 

if  k.count  =  k  then  . (1) 

4.8.1:  if  adapt  jresteirt  =  on  then  . (2-c) 

4.8.1. 1:  a' 


4.2 

4.3 

4.4 

4.5 

4.6 
4.7, 


4.8: 


4.8.1. 2: 
4.8.1.3: 
endif 


=  max  a,- 

adapt  Jrestart  :=  off 

Xo  =  Xi+i  and  restcirt  (goto  step  3). 


endif 

endfor 


Fig.  2.  The  algorithm  of  adaptive  restarted  procedure  for  the  ORTHO- 
MIN(A:)  method,  (AR-ORTHOMIN(fc)) 


even  if  the  convergence  of  the  ORTHOMIN(fc)  method  is  still  slow,  the  situa¬ 
tion  becomes  more  worse  from  which  the  residual  polynomial  has  remained  in 
this  time.  In  this  case,  we  better  do  not  have  to  perform  the  restart.  However, 
after  we  restarted  the  algorithm,  in  order  to  know  the  scalar  ||Qr,ylpi||/l|r,||  in 
advance,  the  additional  computational  cost,  which  is  equal  to  the  iteration  steps, 
is  needed.  Consequently,  one  might  not  expect  with  efficient.  Therefore,  we  per¬ 
form  the  restart  in  an  unconditional  judgment  of  the  first  restart,  and  then  we 
shall  decide  whether  we  do  perform  or  do  not  perform  the  restart  in  according 
to  the  circumstances  of  the  former  update  restart  after  the  second  restart. 

(2)  Rule  of  the  execution  of  restart 

(a)  The  restart  is  done  in  an  unconditional  judgment  of  the  1st  restart. 

(b)  When  we  performed  the  restart,  comparing  the  maximum  value  of  dis¬ 
tance  that  proceeds  in  k  iteration  steps  before  the  restart  and  after  the 
restart,  we  examine  the  efficiency  of  the  restart.  If  one  of  the  maximum 
value  of  distance  proceeding  after  the  restart  is  larger,  we  consider  that 
the  restart  is  working  effectively. 
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Table  1.  APlOOO  specification 


Architecture 

Distributed  Memory,  MIMD 

Number  of  processors 

64 

Inter  processor  networks 

iBroadccist  network(50MB/s) 
Two-dimensional  torus  network 
(25MB/s/port) 

Synchronization  network 

4  Numerical  experiments 

We  now  give  some  numerical  results  to  demonstrate  the  behavior  of  convergence 
associated  with  the  AR-ORTHOMIN(jt)  algorithm.  We  use  the  test  problems 
coming  from  the  boundary  value  problems  of  partial  differential  equation  in  the 
scientific  and  industrial  applications.  We  shall  show  the  eflSciency  of  the  adaptive 
restarted  procedure.  All  the  computations  were  done  in  double  precision  (64, 
bits)  on  the  MIMD  parallel  machine  Fujitsu  APlOOO  with  64  processors.  The 
Specification  of  APlOOO  is  given  in  Table  1.  Each  cell  of  APlOOO  employs  RISC- 
type  SPARC  or  SuperSPARC  processor  chip.  For.  simplicity  we  did  not  use  any 
preconditioner  in  numerical  experiments. 

[Example  1]  Firstly,  we  consider  a  finite  difference  problem,  namely,  central 
finite  differencing  applied  to  the  following  Dirichlet  problem; 

-Uij:  -  Uyy  +  auxix,  y)  +  TUy{x,  y) 

=  fix,y)  on  12  =  [0,  if, 
le/j  =  l  +  xy. 

with  f{x,y)  is  chosen  so  that  the  true  solution  u{x,y)  =  I  +  xy  on  fi.  Let  h 
represent  the  mesh  size  in  each  direction.  This  yields  a  matrix  of  size  n  =  66536 
(where,  h  =  1/257),  after  boundary  points  have  been  eliminated.  In  our  numer¬ 
ical  computations,  the  initial  guess  is  chosen  as  xq  =  0,  and  an  approximate 
solution  Xk  is  considered  to  have  converged  if  the  residual  satisfies  ||r/e||2/||ro||  < 
10  Also,  the  iteration  was  stopped,  when  the  number  of  iteration  exceeded 
6654(ft!  0.1  xn).  By  varying  the  constant  a  and  r,  the  amount  of  nonsymmetric- 
ity  of  the  coefficient  matrix  A  may  be  varied. 

In  Table  2,  we  are  displayed  the  numerical  results  obtained  by  the  stan¬ 
dard  ORTHOMIN(^)  and  AR-ORTHOMIN(A;)  method.  For  this  problem,  AR- 
ORTHOMIN(5)  and  AR-ORTHOMIN(IO)  method  applied  to  this  problem  worked 
quite  well.  On  the  other  hand,  the  standard  ORTHOMIN(A:),  k  =  5,  or  10, 
method  gave  an  excessive  computational  times  and  the  number  of  iterations.  Fig¬ 
ure  4  gives  representative  plots  of  the  convergence  behavior  of  ORTHOMIN(5), 
ORTHOMIN(IO),  AR-ORTHOMIN(5),  and  AR-ORTHOMIN{10)  method  for 
the  case  of  h  =  1/257,  and  (cr+r)/i/4  =  5.0.  As  you  can  seen  clearly,  only  the  AR- 
ORTHOMIN(^)  method  is  successful  in  this  example.  The  ORTHOMIN(lS;)  with¬ 
out  the  adaptive  restarted  procedure  has  some  trouble  from  the  beginning,  which 
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Table  2.  The  numerical  results  for  example  1,  (((t  +  T)h/4  =  0.5) 


(<T  :  t)  I 

1  8:0 

1  7  :  1  1 

1  4  :  4 

((execution  time  (Sec)  )) 

ORTHOMlN(5) 

64.01 

63.98 

55.20 

47.76 

45.37 

AR-ORTHOMIN(5) 

53.23 

54.75 

51.53 

46.51 

47.46 

ORTHOMIN(IO) 

114.06 

103.27 

94.08 

82.08 

83.55 

AR-ORTHOMIN(IO) 

86.00 

82.00 

84.60 

77.61 

77.16 

((number  of  iteration  )) 


ORTHOMIN(5) 

1031 

1030 

891 1 

772 

732 

AR-ORTHOMIN(5) 

838 

852 

798; 

737 

745 

ORTHOMIN(IO) 

1258 

1139 

1039 

906 

AR-ORTHOMIN(IO) 

964 

915 

943 

1  865 

I  858 

((number  of  restart  )) 


AR-ORTHOMIN(5) 

5 

4 

3 

3 

2 

AR-ORTHOMIN(IO) 

5 

3 

3 

3 

2 

Time(sec) 


Fig.  4.  The  convergence  behavior  of  residual  norms  vs.  computational 
time  for  example  2  {(er  +  T)hl4  =  5.0,  (T  :  r  =  8  :  0) 


causes  the  stagnation.  Note  that  in  this  case  the  AR-ORTHOMIN(5)  method  is 
preferable,  because  it  is  more  efficient:  the  working  cost  of  AR-ORTHOMIN(5) 
method  less  than  AR-ORTHOMIN(IO)  method.  This  result  shows  that  the  AR- 
ORTHOMIN(Ar)  method  keeps  the  residual  size  better  behaved  than  the  standard 
ORTHOMIN(A;)  method,  which  without  the  adaptive  restarted  procedure,  over 
the  course  of  run.  We  found  that  in  most  cases  the  AR-ORTHOMIN(/:)  method 
was  more  efficient  than  the  standard  ORTHOMIN(k)  method  in  CPU  times. 

[Example  2]  We  now  consider  a  little  bit  difficult  class  of  finite  difference  dis- 
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Table  3.  The  numerical  results  for  example  2 


ah 

1  2-^ 

1  2-^ 

1  2^^ 

2* 

1  2^ 

((execution  time  (Sec)  )) 

ORTHOMIN(5) 

213.55 

290.96 

344.28 

— 

— 

AR-ORTHOMIN(5) 

191.54 

217.57 

325.40 

— 

— 

ORTHOMIN(IO) 

227.48 

309.03 

431.88 

— 

— 

AR-ORTHOMIN(IO) 

243.68 

276.74 

482.00 

— 

— 

((number  of  iteration  )) 


ORTHOMIN(5) 

3212 

4369 

5185 

(4e- 

■12)* 

(le-8)* 

AR-ORTHOMIN(5) 

2896 

3338 

4945 

(2e- 

10)* 

(2e-8)* 

ORTHOMIN(IO) 

2499 

3400 

4751 

(3e- 

12)* 

(2e-8)* 

AR-ORTHOMIN(IO) 

2723 

3111 

5447 

(3e. 

11)* 

(2e-8)* 

((number  of  restcirt  )) 


AR-ORTHOMIN(5) 

18 

63 

58 

43 

35 

AR-ORTHOMIN(IO) 

15 

24 

_ 

53 

52 

28 

*The  relative  residual  norm  after  the  maximum  iterations 


cretization  of  the  Dirichlet  boundary  value  problem  as  follows: 

-  Wj/y  +  I  ^1/  -  0  0  -  0  Uyj 

=  f{^,y)  on  Q  =  [0, 1]- 
u{x,y)\dn  =  l  +  xy. 

Central  differencing,  with  uniform  mesh  spacing  h  in  each  direction,  yields  a 
n  X  n  sparse  coefficient  matrix.  The  right  hand  side  of  the  above  equation  is 
taken  such  that  the  true  solution  is  u{x,  y)  =  1  +  xy.  Problems  of  this  type  arise 
frequently  in  many  scientific  problem  and  are  significant  practical  importance. 
The  initial  approximation  vector  is  iq  =  0  and  no  preconditioning  is  used  "for 
these  numerical  experiments. 

For  the  test  problem  we  let  h  =  1/257  and  use  several  value  of  a.  We  give 
comparative  results  in  Table  3  with  crh  =  2“^,  2“\  2°,  2\  2^,  respectively.  In  the 
item  of  execution  time  in  this  table,  runs  for  which  convergence  is  not  possible 
maximum  iterations  are  labeled  by  ( — ) . 

In  the  Table  3,  in  most  cases  AR-ORTHOMIN(5)  method  worked  quite  well. 
For  the  case  of  ah  =  2-^,  and  2“\  the  AR-ORTHOMIN(IO)  method  gave  an 
excessive  number  of  iterations  and  the  computational  times. 

Figure  5  gives  representative  plots  of  the  convergence  behavior  of  the  above 
mentioned  methods  with  no  preconditioning  for  the  case  ah  =  2°. 

The  following  observations  on  this  problem  can  be  made.  The  AR-ORTHO- 
MIN(5)  method  worked  well  in  most  cases,  particularly  in  ah  =  2~^ .  As  you  can 
see  that,  for  large  k  such  as  the  AR,-ORTHOMIN(10)  method,  the  improvement 
of  the  computational  cost  is  not  impressive,  but  the  residual  norms  of  the  AR- 
ORTHOMIN(IO)  method  stay  well  below  those  of  the  standard  ORTHOMIN(IO) 
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1e-0l 
1e-02 
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Fig.  5.  The  convergence  behavior  of  residual  norms  vs.  computational 
time  for  example  2  {ah  —  2°) 


method.  We  note  that,  as  expected  from, these  numerical  experiments,  the  AR- 
ORTHOMIN(5)  method  is  slightly  more  efficient  than  the  AR-ORTHOMIN(IO) 
method. 

[Example  3]  Our  last  example  is  taken  from  the  example  of  Reichel  et  al.  [6]  and 
Gutknecht  [7]. 


A 


1  0.5  Q' 

0  1  0.5  ^ 

(7  0  1  0.5 
<7  0  . 

U  <T  0  1 


g  Jj4096x4096 


((T>  0) 


Since  all  the  eigenvalues  of  M  =  (A  +  .4^)/2  are  distributed  in  the  interval 
[—2(7,2  +  2a],  the  condition  number  of  M  becomes  large  so  that  the  element 
(7  is  large.  Also,  the  property  of  positive  definite  of  M  is  not  guaranteed.  On 
the  other  hand,  the  spectral  radius  of  R  =  (A  —  A'^)/2  is  satisfied  the  following 
inequality  p{R)  <  1  +  2<7. 

Table  4  shows  the  numerical  results  for  several  a.  In  this  example,  since 
the  behavior  of  residuals  of  standard  ORTHOMIN(A;)  method  showed  linear 
convergence  by  all  cases,  there  is  no  restat  performed  by  the  AR-ORTHOMIN(A:) 
method. 


5  Conclusion 

Our  study  involved  a  new  approach  to  the  adaptive  restarted  procedure  for 
the  ORTHOMIN(A:)  algorithm.  One  interesting  feature  of  this  technique  is  the 
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Table  4.  The  numerical  results  for  example  3 


cr 

1 

1  0.3 

1  0.5 

|  0.7 

1  0.9 

((execution  time  (sec)  )) 

ORTHOMIN(5) 

0.45 

0.45 

0.83 

1.69 

— 

AR-ORTHOMIN(5) 

0.46 

0.46 

0.84 

1.72 

— 

ORTHOMIN(IO) 

0.68 

0.68 

1.17 

2.29 

6.85 

AR-ORTHOMIN(IO) 

0.68 

0.68 

1.18 

2.31 

6.90 

((number  of  iteration  )) 

ORTHOMIN(5) 

32 

32 

57 

115 

(4e-10)* 

AR-ORTHOMlN(5) 

32 

32 

57 

115 

(4e-10)+ 

ORTHOMIN(IO) 

32 

32 

52 

98 

285 

AR-ORTHOMIN(IO) 

32 

32 

52 

98 

285  ! 

((number  of  restart  ))  | 

AR-ORTHOMIN(5) 

0 

0 

0 

0 

0 

AR-ORTHOMIN(IO) 

0 

0 

0 

0 

0  1 

*The  relative  residual  norm  after  the  maximum  iterations 


fact  that  extra  calculation  is  not  explicitly  needed,  which  may  be  used  only 
implicitly  given  as  calculations  of  the  standard  ORTHOMIN(iS:)  algorithm.  The 
results  presented  in  this  paper  suggest  that  the  adaptive  restarted  procedure  with 
ORTHOMIN(A;)  algorithm,  which  we  called  the  AR-ORTHOMIN(^')  ,  can  be  one 
of  the  useful  tools  for  computing  the  approximate  solution  of  large  and  sparse 
nonsymmetric  linear  systems  of  equations  on  parallel  machines  with  modern 
high  performance  architectures.  The  details  of  the  parallel  implementation  of 
this  strategy  and  the  further  numerical  experiments  are  given  in  Tsuno  and 
Nodera  [15]. 
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Abstract.  A  driving  factor  in  Digital  System  DS  architecture  is  the 
feature  size  of  the  silicon  implementation  process.  We  present  Moore’s 
laws  and  focus  on  the  shrink  laws,  which  relate  chip  performance  to 
feature  size.  The  theory  is  backed  with  experimental  measures  from  [14], 
relating  performance  to  feature  size,  for  various  memory,  processor  and 
FPGA  chips  from  the  past  decade.  Conceptually  shrinking  back  existing 
chips  to  a  common  feature  size  leads  to  common  architectural  measures, 
which  we  call  normalized:  area,  clock  frequency,  memory  2ind  operations 
per  cycle.  We  measure  and  compare  the  normalized  compute  density  of 
various  chips,  architectures  rind  silicon  technologies. 

A  Reconfigurable  System  RS  is  a  standard  processor  tightly  coupled  to  a 
Programmable  Active  Memory  PAM,  through  a  high  bandwidth  digital 
link.  The  PAM  is  a  FPGA  and  SRAM  based  coprocessor.  Through  soft¬ 
ware  configuration,  it  may  emulate  any  specific  custom  hardware,  within 
size  and  speed  limits.  RS  combine  the  flexibility  of  software  programming 
to  the  performance  level  of  application  specific  integrated  circuits  ASIC. 
We  analyze  the  performance  achieved  by  PI,  a  first  generation  RS  [13]. 
It  still  holds  some  significant  absolute  speed  records:  RSA  cryptography, 
applications  from  high-energy  physics,  and  solving  the  Heat  Equation. 
We  observe  how  the  software  versions  for  these  applications  have  gained 
performance,  through  better  microprocessors.  We  compare  with  the  per¬ 
formance  gain  which  Ccin  be  achieved,  through  implementation  in  P2,  a 
second-generation  RS  [16]. 

Recent  experimental  systems,  such  as  the  Dynamically  Programmable 
Arithmetic  Array  in  [19]  and  others  in  [14],  present  advantages  over  cur¬ 
rent  FPGA,  both  in  storage  and  compute  density.  RS  based  on  such  chips 
are  tailored  for  video  processing,  and  similar  compute,  memory  and  10 
bandwidth  intensive.  We  characterize  some  of  the  architectural  features 
that  a  RS  must  posses  in  order  to  be  fit  to  shrink:  automatically  enjoy 
the  optimal  gain  in  performance  through  future  shrinks.  The  key  to  scale, 
for  any  general  purpose  system,  is  to  embed  memory,  computation  and 
communication  at  a  much  deaper  level  than  presently  done. 


1  Moore’s  Laws 

Our  modern  world  relies  on  an  ever  increasing  number  of  Digital  Systems  DS: 
from  home  to  office,  through  car,  boat,  plane  and  elsewhere.  As  a  point  in  case. 
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the  shear  economic  magnitude  of  the  Millenium  Bug  [21],  shows  how  futile  it 

would  be  to  try  and  list  all  the  functions  which  DS  serve  in  our  brave  new 
digital  world. 


Fig.  1.  Estimated  number  and  world  wide  growth  rate;  G  =  10®  transistors  fabricated 
per  year;  G  bit  operations  computed  each  second;  Billion  $  revenues  from  silicon  sold 
world  wide;  $  cost  per  G  =  2®°  bits  of  storage. 


Through  recent  decades,  earth’s  combined  raw  compute  power  has  more  than 
doubled  each  year.  Somehow,  the  market  remains  elastic  enough  to  find  appli¬ 
cations,  and  people  to  pay,  for  having  twice  as  many  bits  automatically  switch 
state  than  twelve  months  ago.  At  least,  many  people  did  so,  each  year,  for  over 
thirty  years  -  fig.  1.  >  j  , 

An  ever  improving  silicon  manufacturing  technology  meets  this  ever  increas¬ 
ing  demand  for  computations:  more  transistors  per  unit  area,  bigger  and  faster 
chips.  On  the  average  over  30  years,  the  cost  per  bit  stored  in  memory  goes  down 
by  30%  each  year.  Despite  this  drop  in  price,  selling  80%  more  transistors  each 
year  increases  revenue  for  the  semi-conductor  industry  by  20%  -  fig.  1. 

The  number  of  transistors  per  mm?  grows  about  40%  each  year,  and  chip 
size  increases  by  15%,  so: 

The  number  of  transistors  per  chip  doubles  in  about  18  months. 

That  is  how  G.  Moore,  one  of  the  founders  of  Intel,  famously  stated  the  laws 
embodied  in  fig.  1.  That  was  in  the  late  sixties,  known  since  as  Moore’s  Laws. 

More  recently,  G.  Moore  [18]  points  out  that  we  will  soon  fabricate  more 
transistors  per  year  than  there  are  living  ants  on  earth:  an  estimated  10^^. 

.1,  computations,  not  transistors.  How  much  computation  do 

they  buy.  Operating  all  of  this  year’s  transistors  at  60  MHz  amounts  to  an 
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aggregate  compute  power  worth  10^^  hop/s  -  bit  operation  per  second.  That 
would  be  on  the  order  of  10  million  bcrp/s  per  ant! 

This  estimate  of  the  world’s  compute  power  could  well  be  off  by  some  order 
of  magnitude.  What  matters  is  that  computing  power  at  large  has  more  than 
doubled  each  year  for  three  decades,  and  it  should  do  so  for  some  years  to  come. 

1.1  Shrink  Laws 


Fig.  2.  Shrink  of  the  feature  size  with  time:  minimum  transistor  width,  in  pm  =  10"®m. 
Growth  of  chip  area  -  in  mm^. 


The  economic  factors  at  work  in  fig.  1  are  separated  from  their  technological 
consequences  in  fig  2.  The  feature  size  of  silicon  chips  shrinks:  over  the  past 
two  decades,  the  average  shrink  rate  was  near  85%  per  year.  During  the  same 
time,  chip  size  has  increased:  at  a  yearly  rate  near  10%  for  DRAM,  and  20%  for 
processors. 

The  effect  on  performance  of  scaling  down  all  dimensions  and  the  voltage  of 
a  silicon  structure  by  1/2:  the  area  reduces  by  1/4,  the  clock  delay  reduces  to 
1/2  and  the  power  dissipated  per  operation  by  1/8. 

Equivalently,  the  clock  frequency  doubles,  the  transistor  density  per  unit  area 
quadruples,  and  the  number  of  operations  per  unit  energy  is  multiplied  by  8,  see 
fig.  2.  This  shrink  model  was  presented  by  [2]  in  1980,  and  intended  to  cover 
feature  sizes  down  to  0.3  pm  -  see  fig.  3. 

Fig.  4  compares  the  shrink  model  from  fig.  3  with  experimental  data  gathered 
in  [14],  for  various  DRAM  chips,  published  between  in  the  last  decade.  The  last 
entry  -  from  [15]  -  accounts  for  synchronous  SDRAM,  where  access  latency  is 
traded  for  throughput.  Overall,  we  find  a  rather  nice  fit  to  the  model.  In  fig.  7, 
we  also  find  agreement  between  the  theoretical  fig.  3  and  experimental  data  for 
microprocessors  and  FPGA,  although  some  architectural  trends  appear. 

A  recent  update  of  the  shrink  model  by  Mead  [9]  covers  features  down  to 
0.03  pm.  The  optimists  conclusion,  from  [9]: 
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f '  performance,  as  the  minimum  transistor  width  (feature  size) 

shrinks  from  8  to  0.03  micron  pm;  transistors  per  square  millimeter;  fastest  possible 
chip  wide  synchronous  clock  frequency,  in  giga  hertz;  number  of  operations  computed 


We  can  safely  count  on  at  least  one  more  order  of  magnitude  of 
scaling. 

The  pessimist  will  observe  that  it  takes  2  pages  in  [2]  to  state  and  justify  the 
hnear  shrink  rules;  it  takes  15  pages  in  [9],  and  the  rules  are  no  longer  linear. 
Indeed,  thin  oxide  is  already  nearly  20  atoms  thick,  at  current  feature  size  0.2 
pm.  A  linear  shrink  would  have  it  be  less  than  one  atom  thick,  around  0.01  pm. 
Other  fundamental  limits  (quantum  mechanical  effects,  thermal  noise,  light’s 
wavelength,  ...)  become  dominant  as  well,  near  the  same  limit.  Although  C. 
Mead  [9]  does  not  explicitly  cover  finer  sizes,  the  implicit  conclusion  is: 

We  cannot  count  on  two  more  orders  of  magnitude  of  scaling. 

Moore’s  law  will  thus  eventually  either  run  out  of  fuel  -  demands  for  hop  Is  will 
some  year  be  under  twice  that  of  the  previous  -  or  it  will  be  out  of  an  engine 
-  shnnk  laws  no  longer  apply  below  0.01  pm.  One  likely  possibility  is  some 
combination  of  both:  feature  size  will  shrink  ever  more  slowly,  from  some  future 
time  on. 

On  the  other  hand,  there  is  no  fundamental  reason  why  the  size  of  chips 
cannot  keep  on  increasing,  even  if  the  shrink  stops.  Likewise,  we  can  expect  new 
architecture  to  improve  the  currently  understood  technology  path.  No  matter 
what  happens,  how  to  best  use  the  available  silicon  will  long  remain  an  im¬ 
portant  question.  Another  good  bet:  the  amount  of  storage,  computation  and 
communication,  available  in  each  system  will  grow,  ever  larger. 


2  Performance  Measures  for  Digital  Systems 

Communication,  processing  and  storage  are  the  three  building  blocks  of  DS. 
They  are  intimately  combined  at  all  levels.  At  micron  scale,  wires,  transistors 
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Fig.  4.  :  Actual  DRAM  performance  as  feature  size  shrinks  from  0.8  to  0.075  um-.  clock 
frequency  in  Mega  hertz;  square  millimeters  per  chip;  bits  per  chip;  power  is  expressed 
in  bit  per  second  per  square  micron. 


and  capacitors  implement  the  required  functions.  At  human  scale,  the  combi¬ 
nation  of  a  modem,  microprocessor  and  memory  in  a  PC  box  does  the  trick. 
At  planet  scale,  communication  happens  through  more  exotic  media  -  waves  in 
the  electromagnetic  ether,  or  optic  fiber  -  at  either  end  of  which  one  finds  more 
memory,  and  more  processing  units. 


2.1  Theoretical  performance  measures 

Shannon’s  Mathematical  Theory  of  Communication  [1]  shows  that  physical  mea¬ 
sures  of  information  (bits  h)  and  communication  (bits  per  second  6/s)  are  related 
to  the  abstract  mathematical  measure  ot  statistical  entropy  H,  a  positive  real 
number  H  >  Q.  Shannon’s  theory  does  not  account  for  the  cost  of  any  compu¬ 
tation.  Indeed,  the  global  function  of  a  communication  or  storage  device  is  the 
identity  X  =  Y. 

On  the  other  hand,  source  coding  for  MPEG  video  is  among  the  most  de¬ 
manding  computational  tasks.  Similarly,  random  channel  coding  (and  decod¬ 
ing),  which  gets  near  the  optimal  for  the  communication  purposes  of  Shannon 
as  coding  blocks  become  bigger,  has  a  computational  complexity  which  increases 
exponentially  with  block  size. 

The  basic  question  in  Complexity  Theory  is  to  determine  how  many  opera¬ 
tions  C(/),  are  necessary  and  sufficient  for  computing  a  digital  function  /.  All 
operations  in  the  computation  of  /  are  accounted  for,  down  to  the  bit  level,  re¬ 
gardless  of  when,  where,  or  how  the  operation  is  performed.  The  unit  of  measure 
for  C{f)  is  one  Boolean  operation  hop.  It  is  applicable  to  all  forms  of  compu- 
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tations  -  sequential,  parallel,  general  and  special  purpose.  Some  relevant  results 
(see  [5]  for  proofs): 

1.  The  complexity  of  n  bit  binary  addition  is  5ti  —  3  bop.  The  complexity  of 
computing  one  bit  of  sum  is  1  add  =  5  bop  (full  adder:  3  in,  2  out). 

2.  The  complexity  of  n  bit  binary  multiplication  can  be  reduced,  from 
bop  for  the  naive  method  (and  4n^  bop  through  Booth  Encoding),  down  to 
c(e)n^+%  for  any  real  number  e  >  0.  As  c(e)  oo  when  e  i-4-  0,  the  practical 
complexity  of  binary  multiplication  is  only  improved  for  n  large. 

3.  Most  Boolean  functions  /,  with  n  bits  of  input  and  one  output,  have  a  bop 
complexity  <?(/)  such  that  2n/n  <  C{f)  <  2nln{2  +  e),  for  all  e  >  0  and  n 
large  enough.  To  build  one,  just  choose  at  random!  No  explicitly  described 
Boolean  function  has  yet  been  proved  to  posses  more  than  hnear  complexity 
(including  multiplication).  An  efficient  way  to  compute  a  random  Boolean 
function  is  through  a  Lookup  Table  LUT,  implemented  with  a  RAM  or  a 
ROM. 

Computation  is  free  in  Shannon’s  model,  while  communication  and  memory  are 
free  within  Complexity  Theory.  The  Theory  of  VLSI  Complexity  aims  at  mea¬ 
suring,  for  all  physical  realizations  of  digital  function  /,  the  combined  complexity 
o(  communication,  memory,  and  computation.  The  VLSI  complexity  of  function 
/  is  defined  with  respect  to  all  possible  chips  for  computing  /.  Implementations 
are  all  within  the  same  silicon  process,  defined  by  some  feature  size,  speed  and 
design  rules.  Each  design  computes  /  within  some  area  A,  clock  frequency  E  and 
T  clock  periods  per  10  sample.  The  silicon  area  A  is  used  for  storage,  commu¬ 
nication  and  computation,  through  transistors  and  wires.  Optimal  designs  are 
selected,  based  on  some  performance  measure.  For  our  purposes:  minimize  the 
area  A  for  computing  function  / ,  subject  to  the  real  time  requirement  E/T  <  Eio. 
In  theory,  one  has  to  optimize  among  all  designs  for  computing  /.  In  practice, 
the  search  is  reduced  to  structural  decompositions  into  well  known  standard 
components:  adders,  multipliers,  shifters,  memories,  . . . 


2.2  Trading  size  for  speed 

VLSI  design  allows  trading  area  for  speed.  Consider,  for  example,  the  family  of 
adders:  their  function  is  to  repeatedly  compute  the  binary  sum  S  =  A+B  of  two 
n  bits  numbers  A,  B.  Fig.  5  shows  four  adders,  each  with  a  different  structure, 
performance,  and  mapping  of  the  operands  through  time  and  10  ports.  Let  us 
analyze  the  VLSI  performance  of  these  adders,  under  simplifying  assumptions: 
Ufa  =  2ar  for  the  area  (based  on  transistor  counts),  and  dfa  =  dr  for  the 
combinatorial  delays  of  fadd  and  reg  (setup  and  hold  delay). 

1.  Bit  serial  (base  2)  adder  sA2.  The  bits  of  the  binary  sum  appear  through 
the  unique  output  port  as  a  time  sequence  Sq,  Si,  ...,  s„,  ...  one  bit  per  clock 
cycle,  from  least  to  most  significant.  It  takes  T  =  n  +  1  cycles  per  sum  S. 
The  area  is  A  =  3or:  it  is  the  smallest  of  all  adders.  The  chip  operates  at 
clock  frequencies  up  to  F  =  l/2dr:  the  highest  possible. 
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sA4  sAI4  2sA2 


Fig.  5.  Four  serial  adders:  sA2  -  base  2,  sA4  -  base  4,  sAIA  -  base  4  interleaved,  and 
2S.42  -  two  independent  sA2.  An  oval  represents  the  full  adder  fadd\  a  square  denotes 
the  register  reg  (one  bit  synchronous  flip-flop;  the  clock  is  implicit  in  the  schematics). 


2.  Serial  two  bits  wide  (base  4)  adder  sAA.  The  bits  of  the  binary  sum  appear 
as  two  time  sequences  So,  S2,  S2n)  and  sj,  S3,  ...  two  bits  per  cycle, 
through  two  output  ports.  Assuming  n  to  be  odd,  we  have  T  =  (n  +  l)/2 
cycles  per  sum.  The  area  is  A  =  Sa^  and  the  operating  frequency  F  =  l/3dr. 

3.  Serial  interleaved  base  4  adder  sAIA.  The  bits  of  the  binary  sum  S  appear 
as  two  time  sequences  sq,  *,  S2,  *,  ...,  S2„,  *,  ...  and  *,  si,  *,  S3,  ...  one  bit 
per  clock  cycle,  even  cycles  through  one  output  port,  odd  through  the  other. 
The  alternate  cycles  (the  *)  are  used  to  compute  an  independent  sum  S' , 
whose  10  bits  (and  carries)  are  interleaved  with  those  for  sum  5.  Although 
it  still  takes  n  + 1  cycles  in  order  to  compute  each  sum  5  and  5',  we  get  both 
sums  in  so  many  cycles,  at  the  rate  of  T  =  (n  +  l)/2  cycles  per  sum.  The 
area  is  A  =  60^  and  the  maximum  operating  frequency  F  =  \/2dr. 

4.  Two  independent  bit  serial  adders  2sA2.  This  circuit  achieves  the  same  per¬ 
formance  as  the  previous:  T  =  (n  -h  l)/2  cycles  per  sum,  area  A  =  6ar  and 
frequency  F  =  l/2dr- 


The  transformation  that  unfolds  the  base  2  adder  sA2  into  the  base  4  adder  sAA 
is  a  special  instance  of  a  general  procedure.  Consider  a  circuit  C  which  computes 
some  function  /  in  T  cycles,  within  gate  complexity  G  bop  and  memory  M  bits. 
The  procedure  from  [11]  unfolds  C  into  a  circuit  C"  for  computing  /:  it  trades 
cycles  T'  =  T  12  for  gates  G'  —  2G,  at  constant  storage  M'  =  M. 

In  the  case  of  serial  adders,  the  area  relation  is  A'  =  5A/3  <  2 A,  so  that 
A'T'  <  AT.  On  the  other  hand,  since  F'  =  l/3d  and  F  =  l/2d,  we  find  that 
A'T'/F'  >  AT/F.  An  equivalent  way  to  measure  this,  is  to  consider  the  density 
of  full  adders  fadd  per  unit  area  a/„  =  2ar,  for  both  designs  C  and  C':  as 
2/A  =  0.66  <  4/A'  =  0.8,  the  unfolded  design  has  a  better  fadd  density  than 
the  original.  Yet,  since  F'  =  1.5F,  the  compute  density  -  in  fadd  per  unit  area 
and  time  dfa  =  dr  -  \s  lower  for  circuit  C':  F/A  =  0.16  >  2/A'F'  =  0.13.  When 
we  unfold  from  base  2  all  the  way  to  base  2n,  the  carry  register  may  be  simplified 
away:  it  is  always  0.  The  fadd  densities  of  this  n-bit  wide  carry  propagate  adder 
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is  1  per  unit  area,  which  is  optimal;  yet,  as  clock  frequency  is  F  =  1/n,  the 
compute  density  is  low:  1/n. 

Circuits  s^/4  and  2sA2  present  two  ways  of  optimally  trading  time  for  area, 
at  constant  operator  and  compute  density.  Both  are  instances  of  general  meth¬ 
ods,  applicable  to  any  function  /,  besides  binary  addition.  From  any  circuit  C 
for  computing  /  within  area  A,  time  T  and  frequency  F,  we  can  derive  circuits 
C  which  optimally  trades  area  A'  =  2A  for  time  T'  =  T/2,  at  constant  clock 
frequency  F'  =  F.  The  trivial  unfolding  constructs  C'  =  2C  from  two  indepen¬ 
dent  copies  of  C,  which  operate  on  separate  10.  So  does  the  interleaved  adder 
sAM,  in  a  different  manner.  Generalizing  the  interleaved  unfolding  to  arbitrary 
functions  does  not  always  lead  to  an  optimal  circuit:  the  extra  wiring  required 
may  force  the  area  to  be  more  than  A'  >  2A.  Also  note  that  while  these  optimal 
unfolding  double  the  throughput  (T  =  n/2  cycles  per  add),  the  latency  for  each 
individual  addition  is  not  reduced  from  the  original  one  (T  =  n  cycles  per  addi¬ 
tion).  We  may  constrain  the  unfolded  circuit  to  produce  the  10  samples  in  the 
standard  order,  by  adding  reformatting  circuitry  on  each  side  of  the  10:  a  buffer 
of  size  n-bit,  and  a  few  gates  for  each  input  and  output  suffice.  As  we  account 
for  the  extra  area  (for  comer  turning),  we  see  that  the  unfolded  circuit  is  no 
longer  optimal:  A  >  2A.  For  a  complex  function  where  a  large  area  is  required, 
the  loss  in  comer  turning  area  can  be  marginal.  For  simpler  functions,  it  is  not. 

In  the  case  of  addition,  area  may  be  optimally  traded  for  time,  for  all  integer 
data  bit  width  D  =  n/T,  as  long  as  D  <  y/n.  Fast  wide  D  =  n  parallel  adders 
have  area  A  =  nlog{n),  and  are  structured  as  binary  trees.  The  area  is  dominated 
by  the  wires  connecting  the  tree  nodes,  their  drivers  (the  longer  the  wire,  the 
bigger  the  driver),  and  by  pipelining  registers,  whose  function  is  to  reduce  all 
combinatorial  delays  in  the  circuit  below  the  clock  period  1/F  of  the  system. 

Transitive  functions  permute  their  inputs  in  a  rich  manner  (see  [4]) :  any  input 
bit  may  be  mapped  -  through  an  appropriate  choice  of  the  external  controls  - 
into  any  output  bit  position,  among  N  possible  per  10  sample.  It  is  shown  in  [4] 
that  computing  a  transitive  function  at  10  rate  D  =  NF/T,  requires  an  area  A 
such  that: 

A>  OmN +  aioD  +  au,D^,  (1) 

where  Um,  aio  and  o^,  are  proportional  to  the  area  per  bit  respectively  required 
for  memory,  10  and  communication  wires.  Note  that  the  gate  complexity  of  a 
transitive  function  is  zero:  input  bit  values  aire  simply  permuted  on  the  output. 
The  above  bound  merely  accounts  for  the  area  -  10  ports,  wires  and  registers  - 
which  is  required  to  acquire,  transport  and  buffer  the  data  at  the  required  rate. 
Bound  (1)  applies  to  shifters,  and  thus  also  to  multipliers.  Consider  a  multiplier 
that  computes  2n-bit  products  on  each  cycle,  at  frequency  F.  The  wire  area  of 
any  such  multiplier  is  proportional  to  n^,  as  T  =  1  in  (1).  For  high  bandwidth 
multipliers,  the  area  required  for  wires  and  pipelining  registers  is  bigger  than 
that  for  arithmetic  operations. 

The  bit  serial  multiplier  (see  [11])  has  a  minimal  area  A  =  n,  high  operating 
frequency  F,  and  it  requires  T  —  2n  cycles  per  product.  A  parallel  nave  multiplier 
has  area  A'  =  and  T'  =  \  cycle  per  product.  In  order  to  maintain  high 
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frequency  F  —  F ,  one  has  to  introduce  on  the  order  of  'n?  pipelining  registers, 
so  (perhaps)  A'  =  2n^  for  the  fully  pipelined  multiplier.  These  are  two  extreme 
points  in  a  range  of  optimal  multipliers:  according  to  bound  (1),  and  within  a 
constant  factor.  Both  are  based  on  nave  multiplication,  and  compute  mul  per 
product.  High  frequency  is  achieved  through  deep  pipelining,  and  the  latency 
per  multiplication  remains  proportional  to  n.  In  theory,  latency  can  be  reduced 
to  T,  by  using  reduced  complexity  n^+^  shallow  multipliers  (see  [3]);  yet,  shallow 
multipliers  have  so  far  proved  bigger  than  nave  ones,  for  practical  values  such 
as  n  <  256. 


2.3  Experimental  performance  measures 

Consider  a  VLSI  design  with  area  A  and  clock  frequency  F,  which  computes 
function  f  inT  cycles  per  iV-bit  sample.  In  theory,  there  is  another  design  for  / 
which  optimally  trades  area  A'  =  2A  for  cycles  T'  =  T/2,  at  constant  frequency 
^  ~  frequency  F  and  the  AT  product  remain  invariant  in  such  an 

optimal  tradeoff.  Also  invariant: 

-  The  gate  density  (in  bop/mm^),  given  by  Dop  =  cif)/A  =  C{f)/AT.  Here 
c(/)  is  the  bop  complexity  of  /  per  cycle,  while  C{f)  is  the  bop  complexity 
per  sample. 

-  The  compute  density  (in  bop/smm^)  is  c{f)F/A  =  FDop. 

Note  that  trading  area  for  time  at  constant  gate  and  compute  density  is  equiv¬ 
alent  to  keeping  F  and  AT  invariant. 

Let  us  examine  how  various  architectures  trade  size  for  performance,  in  prac¬ 
tice.  The  data  from  [14]  tabulates  the  area,  frequency,  and  feature  size,  for 
a  representative  collection  of  chips  from  the  previous  decade:  sRAM  DRAM 
mPROC,  FPGA,  MUL- 

The  normalized  area  A/ provides  a  performance  measure  that  is  indepen¬ 
dent  of  the  specific  feature  size  A.  It  leads  [14]  to  a  queintitative  assessment  of 
the  gate  density  for  the  various  chips,  fig.  6  and  7. 

Unlike  [14],  we  also  normalize  clock  frequency:  the  product  by  the  operation 
density  is  the  normalized  compute  power.  To  define  the  normalized  the  system 
clock  frequency  <f>,  we  follow  [9]  and  use  <j>  =  l/100r(A),  where  r(A)  is  the  minimal 
inverter  delay  corresponding  to  feature  size  A. 

-  The  non  linear  formula  used  for  r((I)  =  d®  is  taken  from  [9]:  the  exponent 
e  =  1  -€(i)  decreases  from  1  to  0.9  as  I  shrinks  from  0.3  to  0.03  pm.  The  non 
linear  effect  is  not  yet  apparent  in  the  reported  data.  It  will  become  more 
significant  with  finer  feature  sizes,  and  clock  frequency  will  cease  to  increase 
some  time  before  the  shrink  itself  stops. 

-  The  factor  100  leads  to  normalized  clock  frequencies  whose  average  value  is 
0.2  for  DRAM,  0.9  for  SRAM,  2  for  processors  and  2  for  FPGA. 

In  the  absence  of  architectural  improvement,  the  normalized  gate  and  compute 
density  of  the  same  function  on  two  different  feature  size  silicon  implementations 
should  be  the  same,  and  this  indicates  an  optimal  shrink. 
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Fig.  6.  Performaiice  of  various  SRAM  and  DRAM  chips,  within  to  a  common  feature 
size  technology:  normalized  clock  frequency  Hz/cf)-,  bit  density  per  normalized  area 
10  A  ;  binary  gate  operations  per  normalized  area  per  normalized  clock  period  1/0. 


-  The  normalized  performance  figures  for  SRAM  chips  in  fig.  6  are  all  within 
range:  from  one  half  to  twice  the  average  value. 

-  The  normalized  bit  density  for  DRAM  chips  in  the  data  set  is  4.5  times 
that  of  SRAM.  Observe  in  fig.  6  that  it  has  increased  over  the  past  decade, 
as  the  result  of  improvements  in  the  architecture  of  the  memory  cell  {trench 
capacitors).  The  average  normalized  speed  of  DRAM  is  4.5  times  slower  than 
SRAM.  As  a  consequence  the  average  normalized  compute  density  of  SRAM 
equals  that  of  DRAM.  The  situation  is  different  with  SDRAM  (last  entry  in 
fig.  6):  with  the  storage  density  of  DRAM  and  nearly  the  speed  of  SRAM,  the 
normalized  compute  density  of  SDRAM  is  4  times  that  of  either:  a  genuine 
improvement  in  memory  architecture. 

A  Field  Programmable  Gate  Array  FPGA  is  a  mesh  made  of  programmable 
gates  and  interconnect  [17].  The  specific  function  -  Boolean  or  register  -  of  each 
gate  in  the  mesh,  and  the  interconnection  between  the  gates,  is  coded  in  some 
binary  bitstream,  specific  to  function  f,  which  must  first  be  downloaded  into 
the  configuration  memory  of  the  device.  At  the  end  of  configuration,  the  FPGA 
switches  to  user  mode:  it  then  computes  function  /,  by  operating  just  as  any 
regular  ASIC  would. 

The  comparative  normalized  performance  figures  for  various  recent  micro¬ 
processors  and  FPGA  is  found  in  fig.  7. 

-  Microprocessors  in  the  survey  appear  to  have  maintained  their  normalized 
compute  density,  by  trading  lower  normalized  operation  density,  for  a  higher 
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Fig.  7.  Performance  of  various  microprocessor  and  FPGA  chips  from  [14],  within  a 
common  feature  size  technology:  normalized  clock  frequency  normalized  bit 

density;  normalized  gate  and  compute  density:  for  Boolean  operations,  additions  and 
multiplication’s. 


normalized  clock  frequency,  as  feature  size  has  shrunk.  Only  the  micropro¬ 
cessors  with  a  built-in  multiplier  have  kept  the  normalized  compute  density 
constant.  If  we  exclude  multipliers,  the  normalized  compute  density  of  mi¬ 
croprocessors  has  actually  decreased  through  the  sample  data. 

-  FPGA  have  stayed  much  closer  to  the  model,  and  normalized  performances 
do  not  appear  to  have  chcinged  significantly  over  the  survey  (rightmost  entry 
excluded). 

3  Reconfigurable  Systems 

A  Reconfigurable  System  RS  is  a  standard  sequential  processor  (the  host)  tightly 
coupled  to  a  Programmable  Active  Memory  PAM,  through  a  high  bandwidth  link. 
The  PAM  is  a  reconfigurable  processor,  based  on  FPGA  and  SRAM.  Through 
software  configuration,  the  PAM  emulate  any  specific  custom  hardware,  within 
size  and  speed  limits.  The  host  can  write  into,  and  read  data  from  the  PAM,  as 
with  ciny  memory.  Unlike  conventional  RAM,  the  PAM  processes  data  between 
write  and  read  cycles:  it  an  active  memory.  The  specific  processing  is  determined 
by  the  contents  of  its  configuration  memory.  The  content  of  configuration  mem¬ 
ory  can  be  updated  by  the  host,  in  a  matter  of  milliseconds:  it  is  programmable. 

RS  combine  the  flexibility  of  software  programming  to  the  performance  level 
of  application  specific  integrated  circuits  ASIC.  As  a  point  in  case,  consider  the 
system  PI  described  in  [13].  From  the  abstract  of  that  paper: 


529 


FEUP  -  F aculdade  de  Engenharia  da  Universidade  do  Porto 


We  exhibit  a  dozen  applications  where  PAM  technology  proves  supe¬ 
rior.  both  in  performance  and  cost,  to  every  other  existing  technology, 
including  supercomputers,  massively  parallel  machines,  and  conventional 
custom  hardware. 

The  fields  covered  include  computer  arithmetics,  cryptography,  error 
correction,  image  analysis,  stereo  vision,  video  compression,  sound  syn¬ 
thesis,  neural  networks,  high-energy  physics,  thermodynamics,  biology 
and  astronomy. 

At  comparable  cost,  the  computing  power  virtually  available  in  a 
PAM  exceeds  that  of  conventional  processors  by  a  factor  10  to  1000, 
depending  on  the  specific  application,  in  1992. 

RS  PI  is  built  from  chips  available  in  92  -  SRAM,  PPG  A  and  processor.  Six  long 
technology  years  later,  it  still  holds  at  least  4  significant  absolute  speed  records. 
In  theory,  it  is  a  straightforward  matter  to  port  these  applications  on  a  state 
of  the  art  RS,  and  enjoy  the  performance  gain  from  the  shrink.  In  the  practical 
state  of  our  CAD  tools,  porting  the  highly  optimized  PI  designs  on  oher  systems 
would  require  time  and  skills.  On  the  other  hand,  it  is  straightforward  to  estimate 
the  performance  without  doing  the  actual  hardware  implementation.  We  use  the 
Reconfigurable  System  P2  [16]  -  built  in  97  -  to  conceptually  implement  the 
same  applications  as  PI,  and  compare.  The  P2  system  has  1/4  the  physical  size 
and  chip  count  of  PI.  Both  have  roughly  the  same  logical  size  (4k  CLB),  so  the 
applications  can  be  transferred  without  any  redesign.  The  clock  frequency  is  66 
MHz  on  P2,  and  25MHz  on  PI  (and  33MHz  for  RSA).  So,  the  applications  will 
run  at  least  twice  faster  on  P2  than  on  PI.  Of  course,  if  we  compare  equal  size 
and  cost  systems,  we  have  to  match  PI  against  4P2,  and  the  compute  power  has 
been  multiplied  by  at  least  8.  This  is  expected  by  the  theory,  as  the  feature  size 
of  chips  in  PI  is  twice  that  of  chips  in  P2. 

What  has  been  done  [20]  is  to  port  and  run  on  recent  fast  processors,  the 
software  version  for  some  of  the  original  PI  applications.  That  provides  us  with 
a  technology  update  on  the  respective  compute  power  of  RS  and  processors. 

3.1  3D  Heat  Equation 

The  fastest  reported  software  for  to  solving  the  Heat  Equation  on  a  supercom¬ 
puter,  is  presented  in  [6].  It  is  based  on  the  finite  differences  method.  The  Heat 
Equation  can  be  solved  more  efficiently  on  specific  hardware  structures  [7]: 

-  Start  from  an  initial  state  -  at  time  tAt  -  of  the  discrete  temperatures  in  a 
discrete  3D  domain,  all  stored  in  RAM. 

-  Move  to  the  next  state  -  at  time  {t  -j-  \)At  -  by  traversing  the  RAM  three 
times,  along  the  x,  y  and  z  axis. 

-  On  each  traversal,  the  data  from  the  RAM  feeds  a  pipeline  of  averaging 
operators,  and  the  output  of  the  pipeline  is  stored  back  in  RAM. 

Each  averaging  operator  computes  the  average  value  {at  +  at+i)/2  of  two  consec¬ 
utive  samples  at  and  oj+i .  In  order  to  be  able  to  reduce  the  precision  of  internal 
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Fig.  8.  Schematics  of  a  hardware  pipeline  for  solving  the  Heat  equation.  It  is  drawn 
with  a  pipeline  depth  of  4,  and  bit  width  of  4,  plus  2  bits  for  randomized  round  ofiF. 
The  actual  1  pipeline  is  256  deep,  and  16+2  wide.  Pipelining  registers,  which  allow  the 
network  to  operate  at  maximum  clock  frequency,  are  not  indicated  here.  Neither  is  the 
random  bit  generator. 
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temperatures  down  to  16  bits,  it  is  necessary,  when  division  by  two  is  odd  to 
distribute  that  low-order  bit  randomly  between  the  sample  and  its  neighbor.  All 
deterministic  round-off  schemes  lead  to  parasitic  effects  that  can  significantly 
perturb  the  result.  The  pseudo-randomness  is  generated  by  a  64-bit  linear  feed¬ 
back  ^htft-register  LFSR.  The  resulting  pipeline  is  shown  in  fig,  8.  Instead  of 
being  shifted,  the  least  significant  sum  bit  is  either  delayed  or  not,  based  on  a 
random  choice  in  the  LFSR. 


PI  standing  design  can  accurately  simulate  the  evolution  of  temperature 
over  time  in  a  3D  volume,  mapped  on  512^  discrete  points,  with  arbitrary  power 
source  distributions  on  the  boundaries.  In  order  to  reproduce  that  computation 
m  real  time,  it  takes  a  40,000  MIPS  equivalent  processing  power:  40  G  instruc- 

tions  per  second,  on  32b  data.  This  is  out  of  the  reach  of  microprocessors,  at 
least  until  2001. 


3.2  High  Energy  Physics 

hv  ^  of  benchmarks  proposed 

y  LbRN  [12J  The  goal  is  to  measure  the  performance  of  various  computer  archi¬ 
tectures  in  order  to  build  the  electronics  required  for  the  Large  Hadron  Collider 
LHC,  soori  after  the  turn  of  the  millennium.  Both  benchmarks  are  challenging 
and  well  documented  for  a  wide  variety  of  processing  technologies,  including 
some  of  the  fastest  current  computers,  DSP-based  multiprocessors,  systolic  ar¬ 
rays  massively  parallel  arrays,  Reconfigurable  Systems,  and  full  custom  ASIC 
based  solutions. 

problem  is  to  find  straight  lines  (particle  trajectories)  in  a  noisy 
dipta^  black  and  white  image.  The  rate  of  images  is  at  100  kHz;  the  implied  10 
rate  close  to  200  MB/s,  and  the  low  latency  requirement  (2  images)  preclude 
any  implementation  solution  other  specialized  hardware,  as  shown  by  [121. 

implementation  of  the  TRT  is  based  on  the  Fast  Hough  Transform 
[lUJ,  an  algonthm  whose  hardware  implementation  trades  computation  for  wiring 
complexity.  To  reproduce  the  PI  performance  reported  in  [12],  a  64-bit  sequential 
processor  needs  to  run  at  over  1.2  GHz.  That  is  about  the  amount  of  compu- 
ation  one  gets,  in  1998,  with  a  dual  processor,  64-bit  machine,  at  600  MHz 
The  required  external  bandwidth  (up  to  300  MB/s)  is  what  still  keeps  such 
application  out  of  current  microprocessor  reach. 


3.3 


RSA  cryptography 


The  PI  design  for  RSA  cryptography  combines  a  number  of  algorithm  tech- 

STnn  “  decryption  rate  in  excess 

/s,  although  It  uses  only  half  the  logical  resources  available  in  Pi. 

The  implementation  takes  advantage  of  hardware  reconfiguration  in  manv 
ways:  a  rather  different  design  is  used  for  RSA  encryption  and  decryption;  a 
different  harc^are  modular  multiplier  is  generated  for  each  different  prime  mod¬ 
ulus.  the  coefficients  of  the  binary  representation  of  each  modulus  is  hardwired 
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into  the  logical  equations  of  the  design.  None  of  these  techniques  is  readily  appli¬ 
cable  to  ASIC  implementations,  where  the  same  chip  must  do  both  encryption 
and  decryption,  for  all  keys. 

As  of  printing  time,  this  design  still  holds  the  acknowledged  shortest  time 
per  block  of  RSA,  all  digital  species  included.  It  is  surprising  that  it  has  held 
five  years  against  other  RSA  hardware.  According  to  [20],  the  record  will  go 
to  a  (soon  to  be  announced)  Alpha  processor  (one  64b  multiply  per  cycle,  at 
750MHz)  running  (a  modified  version  of)  the  original  software  version  in  [8].  We 
expect  the  record  to  be  claimed  back  in  the  future  by  a  P2  RSA  design;  yet, 
the  speedup  between  PI  was  lOx  reported  in  92,  and  we  estimate  that  it  should 
be  only  be  6x  on  2P2,  in  97.  The  reason:  the  fully  pipelined  multiplier,  found 
in  recent  processors,  is  fully  utilized  by  RSA  software.  A  normalized  measure  of 
the  impact  of  multiplier  on  theoretical  performance  can  be  observed  in  fig.  7. 

For  the  Heat  Equation,  the  actual  performance  ratio  between  PI  and  the 
fastest  processor  (64b,  250MHz)  was  lOOx  in  92;  with  4P2  against  the  64b, 
750MHz  processor,  the  ratio  should  be  over  200x  in  98.  Indeed,  the  computation 
in  fig.  8  combines  16  b  add  and  shift,  with  Boolean  operations  on  three  low  order 
bits:  software  is  not  efficient,  and  the  multiplier  is  not  used. 


4  What  will  Digital  Systems  shrink  to? 

Consider  a  DS  whose  function  and  real  time  frequency  remain  fixed,  once  and 
for  all.  Examples:  digital  watch,  b&kh/s  modem  and  GPS. 

How  does  such  DS  shrink  with  feature  size? 

To  answer,  start  from  the  first  chip  (feature  size  1)  which  computes  function 
/:  area  A,  time  T,  and  clock  frequency  F.  Move  in  time,  and  shrink  feature 
size  to  1/2.  The  design  now  has  area  A'  =  A/4,  and  the  clock  frequency  doubles 

=  2^  {F'  =  (2— e)F  with  non-linear  shrink).  The  number  of  cycles  per  sample 
remains  the  same:  T'  =  T.  The  new  design  has  twice  (or  2  —  e)  the  required  real 
time  bandwidth:  we  can  (in  theory)  further  fold  space  in  time;  produce  a  design 
C"  for  computing  /  within  area  A"  =  A' /2  =  A/8  and  T"  =  2T  cycles,  still 
at  frequency  F"  =  F'  =  2F.  The  size  of  any  fixed  real  time  DS  shrinks  very 
fast  with  technology,  indeed.  At  the  end  of  that  road,  after  so  many  hardware 
shrinks,  the  DS  gets  implemented  in  software. 

On  the  other  hand,  microprocessors,  memories  eind  FPGA  actually  grow  in 
area,  as  feature  size  shrinks.  So  far,  such  commodity  products  have  each  aimed 
at  delivering  ever  more  compute  power,  on  one  single  chip.  Indeed,  if  you  look 
inside  some  recent  digital  device,  chances  are  that  you  will  see  mostly  three 
types  of  chips;  RAM,  processor  and  FPGA.  While  a  specific  DS  shrinks  with 
feature  size,  a  general  purpose  DS  gains  performance  through  the  shrink,  ideally 
at  constant  normalized  density. 
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4.1  System  on  a  chip 

There  are  compelling  reasons  for  wanting  a  Digital  System  to  fit  on  a  single  chip. 
Cost  per  system  is  one.  Performance  is  another: 

-  Off-chip  communication  is  expensive,  in  area,  latency  and  power.  The  band¬ 
width  available  across  some  on-chip  boundary  is  orders  of  magnitude  that 
across  the  corresponding  off-chip  boundary. 

—  If  one  quadruples  the  area  of  a  square,  the  perimeter  just  doubles.  As  a 

consequence,  when  feature  size  shrinks  by  l/i,  the  internal  communication 
bandwidth  grows  faster  than  the  external  10  bandwidth:  against  x^"'. 

This  is  true  as  long  as  silicon  technology  remains  planeir:  transistors  within 
a  chip,  and  chips  within  a  printed  circuit  board,  must  all  layed  out  side  by 
side  (not  on  top  of  each  other). 

4.2  Ready  to  Shrink  Architecture 

So  far,  normalized  performance  density  has  been  maintained,  through  the  suc¬ 
cessive  generations  of  chip  architecture. 

Can  this  be  sustained  in  future  shrinks? 

A  dominant  consideration  is  to  keep  up  the  system  clock  frequency  F.  The 
formula  for  the  normalized  clock  frequency  1/0  =  lOOr(A)  implies  that  each 
combinatorial  sub-circuit  within  the  chip  must  have  delay  less  than  lOOx  that  of 
a  minimal  size  inverter.  The  depth  of  combinatorial  gates  that  may  be  traversed 
along  any  path  between  two  registers  is  limited.  The  length  of  combinatorial 
paths  is  limited  by  wire  delays.  It  follows  that  only  finitely  many  combinatorial 
structures  can  operate  at  normalized  clock  frequency  0.  There  is  a  limit  to  the 
number  N  of  10  bits  to  any  combinatorial  structure  which  can  operate  at  such  a 
high  frequency.  In  particular,  this  applies  to  combinatorial  adders  (say  N  <  256), 
multipliers  (say  N  <  64)  and  memories. 

4.3  Reconfigurable  Memory 

The  use  of  fast  SRAM  with  small  block  size  is  common  in  microprocessors: 
for  registers,  data  and  instruction  caches.  Large  and  fast  current  memories  are 
made  of  many  small  monolithic  blocks.  A  recent  SDRAM  is  described  in  [15]: 
IGb  stored  as  32  combinatorial  blocks  of  32M6  each.  A  1.6  GB/s  bandwidth  is 
obtained:  data  is  646  wide  at  200MHz. 

By  the  argument  from  the  preceding  section,  a  large  N  bit  memory  must 
be  broken  into  N/B  combinatorial  blocks  of  size  B,  in  order  to  operate  at  nor¬ 
malized  clock  frequency  F  =  0.  A  bit  memory  with  minimum  latency  may 
be  constructed,  through  recursive  decomposition  into  4  quad  memories,  each  of 
size  N / 4.  -  layed  out  within  one  quarter  of  the  chip.  The  decomposition  stops 
for  N  =  B,  when  a  block  of  combinatorial  RAM  is  used.  The  access  latency  is 
proportional  to  the  depth  log{N/ B)  of  the  hierarchical  decomposition. 

A  Reconfigurable  Memory  RM  is  an  array  of  high  speed  dense  combinatorial 
memory  blocks.  The  blocks  are  connected  through  a  reconfigurable  pipelined 
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wiring  structure.  As  with  FPGA,  the  RM  has  a  configuration  mode,  during 
which  the  configuration  part  of  the  RM  is  loaded.  In  user  mode,  the  RM  is  some 
group  of  memories,  whose  specific  interconnect  and  block  decomposition  is  coded 
by  the  configuration.  One  can  trade  data  width  for  address  depth,  from  1  x  A’ 
to  N/B  X  B  in  the  extreme  cases. 

A  natural  way  to  design  a  RM  is  to  imbed  blocks  of  SRAM  within  a  FPGA 
structure.  In  CHESS  [19],  the  atomic  SRAM  block  has  size  8  x  256.  The  SRAM 
blocks  form  a  regular  pitch  matrix  within  the  logic,  and  it  occupies  about  30% 
of  the  area.  As  a  consequence,  the  storage  density  of  CHESS  is  over  1/3  that 
of  a  monolithic  SRAM.  This  is  comparable  to  the  storage  density  of  current 
microprocessors;  it  is  much  higher  than  the  storage  density  of  FPGA,  which  rely 
(so  far)  on  off-chip  memories. 

After  configuration,  the  FPGA  is  a  large  array  of  small  SRAM:  each  is  used 
as  LUT  -  typically  LUT4.  Yet,  most  of  the  configuration  memory  itself  is  not  ac¬ 
cessible  as  a  computational  resource  by  the  application.  In  most  current  FPGA, 
the  process  of  downloading  the  configuration  is  serial,  and  it  writes  the  entire 
configuration  memory.  In  a  0.5x  shrink,  the  download  time  doubles:  4x  bits  at 
(2-e)x  the  frequency.  As  a  consequence,  the  download  takes  about  20  ms  on  PI, 
and  40  ms  on  P2. 

A  more  efficient  alternative  is  found  in  the  X6k  [17]  and  CHESS:  in  config¬ 
uration  mode,  configuration  memory  is  viewed  as  a  single  SRAM  by  the  host 
system.  This  allows  for  faster  complete  download.  An  important  feature  is  the 
ability  to  randomly  access  the  elements  of  the  configuration  memory.  For  the 
RSA  design,  this  allows  for  very  fast  partial  reconfigurations:  as  we  change  the 
value  of  the  5126  key  which  is  hardwired  into  the  logical  equations,  only  few  of 
the  configuration  bits  have  to  updated.  Configuration  memory  can  also  be  used 
as  a  general-purpose  communication  channel  between  the  host  and  the  applica¬ 
tion. 

4.4  Reconfigurable  Arithmetic  Array 

The  normalized  gate  density  of  current  FPGA  is  over  lOx  that  of  processors, 
both  for  Boolean  operations  and  additions  -  fig.  7.  This  is  no  longer  true  for 
the  multiply  density,  where  common  FPGA  barely  meets  the  multiply  density 
of  processors  which  recently  integrate  one  (or  more)  pipelined  floating  point 
multiplier. 

The  arithmetical  density  of  RS  can  be  raised:  MATRIX  [DeHon],  which  is 
an  array  of  8b  ALU,  with  Reconfigurable  Interconnect,  does  better  than  FPGA. 
CHESS  is  based  on  46  ALU,  which  are  packed  as  the  white  squares  in  a  chess¬ 
board.  It  follows  that  CHESS  has  an  arithmetic  density  which  is  near  1/3  that  of 
custom  multipliers.  The  synchronous  registers  in  CHESS  are  46  wide,  and  they 
are  found  both  within  ALU  and  routing  network,  to  as  to  facilitate  high  speed 
systematic  pipelining. 

Another  feature  of  CHESS  [19],  is  that  each  black  square  in  the  chessboard 
may  be  used  either  as  a  switchbox,  or  as  a  memory,  based  on  a  local  configuration 
bit.  As  a  switchbox,  it  operates  on  46  nibbles,  which  are  all  routed  together.  In 
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memory  mode,  it  may  implement  various  specialized  memories,  such  as  a  depth 
8  shift  register,  in  place  of  eight  4b  wide  synchronous  registers.  In  memory  mode, 
it  can  also  be  used  as  a  4b  in,  4b  out  ALUTA.  This  feature  provides  CHESS  with 
a  LUT4  density  which  is  as  high  as  for  any  FPGA. 


4.5  Hardware  or  Software? 

In  order  to  implement  digital  function  Y  =  f{X),  start  from  a  specification  by  a 
program  in  some  high  level  language.  Some  work  is  usually  required  to  have  the 
code  match  the  digital  specification,  bit  per  bit  -  high  level  languages  provide 
little  support  for  funny  bit  formats  and  operations  beneath  the  word  size. 

Once  done,  compile  and  unwind  this  code  so  as  to  obtain  the  run-code  C/ .  It 
is  the  sequence  of  machine  instructions,  which  a  sequential  processor  executes, 
in  order  to  compute  output  sample  Yt  from  input  sample  Xf  This  computation 
IS  to  be  repeated  indefinitely,  for  consecutive  samples:  t=0,  1,  For  the  sake  of 
simplicity,  assume  the  run-code  to  be  straight-line:  each  instruction  is  executed 
once  in  sequence,  regardless  of  individual  data  values;  there  is  no  conditional 
branch.  In  theory,  the  run-code  should  be  one  of  minimal  length,  among  all  pos¬ 
sible  for  function  /,  within  some  given  instruction  set.  Operations  are  performed 
in  sequence  through  the  Arithmetic  and  Logic  Unit  ALU  of  the  processor.  Inter¬ 
nal  memory  is  used  to  feed  the  ALU,  and  provide  (memory-mapped)  external 
ro.  For  W  the  data  width  of  the  processor,  the  complexity  of  so  computing  f 
is  W\Cf  \  bop  per  sample.  It  is  greater  than  the  gate  complexity  G(f).  Equality 
\^f\  =  G{f)/W  only  happens  in  ideal  cases.  In  practice,  the  ratio  between  the 
two  can  be  kept  close  to  one,  at  least  for  straight-line  code. 

The  execution  of  run-code  Cf  on  a  processor  chip  at  frequency  F  computes 
function  /  at  the  rate  of  F/C  samples  per  second,  with  C  =  \C}\.  The  feasibility 
of  a  software  implementation  of  the  DS  on  that  processor  depends  on  the  real 
time  requirement  Fio  -  in  samples  per  second. 

1.  If  F/C  >  Fio,  the  DS  can  be  implemented  on  the  sequential  processor  at 
hand,  through  straightforward  software. 

2.  If  F/C  <  Fio,  one  needs  amore  parallel  implementation  of  the  digital  system. 

In  case  1,  the  full  computing  power  -  IFF  in  bop/s  -  of  the  processor  is  only  used 
when  F/C  =  Fi„.  When  that  is  not  the  case,  say  F/C  >  2Fio,  one  can  attempt 
to  trade  time  for  area,  by  reducing  the  data  width  to  W/2,  while  increasing 
the  code  length  to  2C:  each  operation  on  W  bits  is  replaced  by  two  operations 
on  W/2  bits,  performed  in  sequence.  The  invariant  is  the  product  CW,  which 
gives  the  complexity  of  /  in  bop  per  sample.  One  can  thus  find  the  smallest 
processor  on  which  some  sequential  code  for  /  can  be  executed  within  the  real 
time  specification.  The  end  of  that  road  is  reached  for  W  =  1:  a  single  bit  wide 
sequential  processor,  whose  run-code  has  length  proportionnal  to  G(/). 

In  case  2,  and  when  one  is  not  far  away  from  meeting  the  real  time  require¬ 
ment  -  say  F/C  <  8Fo  -  it  is  advised  to  check  if  code  C  could  be  further  reduced, 
or  moved  to  a  wider  and  faster  processor  (either  existing  or  soon  to  come  when 
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the  feature  size  shrinks  again).  Failing  that  software  solution,  one  has  to  find  a 
hardware  one.  A  common  case  mandating  a  hardware  implementation,  is  when 
■f’  ~  -Pio-  the  real  time  external  10  frequency  Fio  is  near  the  internal  clock 
frequency  F  of  the  chip. 

4.6  Dynamic  Reconfiguration 

We  have  seen  how  to  fold  time  in  space:  from  a  small  design  into  a  larger  one, 
with  more  performance.  The  inverse  operation,  which  folds  space  in  time,  is  not 
always  possible:  how  to  fold  einy  bit  serial  circuit  (such  as  the  adder  from  fig  5) 
into  a  half-size  and  half-rate  structure  is  not  obvious.  Known  solutions  involve 
dynamic  reconfiguration. 

Suppose  that  function  /  may  be  computed  on  some  RS  of  size  2A,  at  twice 
the  real-time  frequency  F  =  2Fio.  We  need  to  compute  /  on  a  RS  of  size  A  at 
frequency  Fio  per  sample.  One  technique,  which  is  commonly  used  in  [13],  works 
when  V  =  fiX)  =  g{h{X)),  and  both  g  cind  h  fit  within  size  A. 

1.  Change  the  RS  configuration  to  design  h. 

2.  Process  N  input  samples  X;  store  each  output  sample  Z  =  h{X)  in  an 
external  buffer. 

3.  Change  the  RS  configuration  to  design  g. 

4.  Process  the  N  samples  Z  from  the  buffer,  and  produce  the  final  output 
y  =  9{Z)- 

5.  Go  to  1,  and  process  the  next  batch  of  N  samples. 

Reconfiguration  takes  time  R/F,  and  the  time  to  process  N  samples  is  2(Ar  -(- 
R)/F  =  {N  +  R)/Fio.  The  frequency  per  sample  Fio/(l  +  R/N)  gets  close  to 
real-time  Fio,  as  N  gets  large.  Buffer  size  and  latency  are  also  proportional  to  N, 
and  this  form  of  dynamic  reconfiguration  may  only  happen  at  a  low  frequency. 

The  opposite  situation  is  found  in  the  ALU  of  a  sequential  processor;  the  op¬ 
eration  may  change  on  every  cycle.  The  same  holds  in  dynamically  programmable 
systems,  such  as  arrays  of  processors  and  DPGA  [14].  With  such  a  system,  one 
can  reduce  by  half  the  number  of  processors  for  computing  /,  by  having  each 
execute  twice  more  code.  Note  that  this  is  a  more  efficient  way  to  fold  space 
in  time  than  previously;  no  external  memory  is  required,  and  the  latency  is  not 
significantly  affected. 

The  ALU  in  CHESS  is  also  dynamically  programmable.  Although  no  special¬ 
ized  memory  is  provided  for  storing  instructions  (unlike  DPGA),  it  is  possible 
to  build  specialized  dynamically  programmed  sequential  processors,  within  the 
otherwise  statically  configured  CHESS  array.  Through  this  feature,  one  can  mod¬ 
ulate  the  amount  of  parallelism  in  the  implementation  of  a  function  /,  in  the 
range  between  serial  hardware  and  sequential  software,  which  is  not  accessible 
without  dynamic  reconfiguration. 

5  Conclusion 

We  expect  it  to  be  possible  to  build  Reconfigurable  Systems  of  arbitrary  size, 
which  are  fit  to  shrink:  they  can  exploit  all  the  available  silicon,  with  a  high 


537 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


““j  :Lrf::s;n“ 

whil"  ul  RS  p™  “  S  Xeep-up  with  the  eppply 

hardwL'teSeTat'X‘pace‘«7by 

Let  us  take  the  conclusion  from  Carver  Mead  [9]; 

havrtveloped  Te'pa^ad'gms  to  us^^'"  ^^^timeter  of  silicon  than  we 


References 

‘  sS  Of  .It?;":  IT:  <»  u.,v„- 

3.  F.P  PreIata°andVIlT'°”  a*  systems,  Addison  Wesley,  1980. 
(Springe, -Vetlag),  Hnifa,  iT  Jul  I, 

IVansactions  in  ComlT&lTMo'oTsS."'”’ 

o:  O.TMlTToTLaT'"!,"'''"”' 

Sliiben.C-A.  Wie  and  U  ioTteT'  Solchenbach,  K. 

a  survey  of  recent  developments”  Imea  f  methods  on  parallel  computers— 

vol.  3(1),  pp.  1-75,  ITm' ITS  «»3  finpineerinp, 

Villars,  RA.io,?7  SSl  S,  "®  “ 

VI.S,  Sign'al  P,“L?ng,  V  8,  T 

1  Applkatmn-SpecScAtrPrfSL^^^^^^  Conference  on 

1994  IEEE  TVans.’on  Computers,  43:8:868-79. 

^^Appl^aHons  yDTckRLeTpV9^mLbl"'t^^^  High-Energy  Physics 

Memon’e!;  ffte  ^  Boucard  Programmable 

1996.  ^  I  IEEE  Trans,  on  VLSI,  Vol.  4,  NO.  1,  pp.  56-69, 

isTISfT  •''■ 

emrckie,r i'ylrc-stped  TmTpT'Iy 'I 

donrn^  0,S.lid-s...,IcTTfS:lIT ISIS  ^ 


538 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


16.  M.  Shand,  Pamette,  a  Reconfigurable  System  for  the  PCI  Bus,  1998. 
http://www.research.digital.com/SRC/pamette/ 

17.  Xilinx,  Inc.,  The  Programmable  Gate  Array  Data  Book,  Xilinx,  2100  Logic  Drive, 
San  Jose,  CA  95124  USA,  1998. 

18.  G.  Moore.  An  Update  on  Moore’s  Law,  1998.  , 

http://www.intel.com/pressroom/ajchive/speeches/gem93097.htm 

19.  Alan  Marshall,  Tony  Stansfield,  Jean  Vuillemin  CHESS:  a  Dynamically  Pro¬ 
grammable  Arithmetic  Array  for  Multimedia  Processing,  Hewlett  Packard  Labora¬ 
tories,  Bristol,  1998. 

20.  M.  Shand.  An  Update  on  RSA  software  performance,  private  communication,  1998. 

21.  The  millenium  bug:  how  much  did  it  really  cost?,  your  newspaper,  2001. 


539 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


540 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


A  Method  Based  on  Orthogonal  Transformation  for 
the  Design  of  Optimal  Feedforward  Network 
Architecture 


Bachiller  P.,  Perez  R.M.,  Martinez  P.,  Aguilar  P.L.,  Calle  J.E. 

Department  of  Computer  Sciences.  University  of  Extremadura.  Escuela  Politecnica.  1 007 1 

Cdceres.  Spain. 

{Pilarb,  Rosapere,  Pablomar,  Paguilar}  @unex.es 


Abstract.  The  problem  of  determining  the  optimum  size  of  a  feedforward 
neural  network  is  recognized  to  be  crucial  for  its  practical  implications  in  such 
important  issues  as  learning  and  generalization.  Several  approaches  for 
designing  optimum  size  networks  have  been  proposed  in  the  literature,  which 
consist  of  training  a  larger  than  necessary  network  and  then  removing  the 
unnecessary  links  and  nodes.  In  this  kind  of  approaches,  commonly  known  as 
pruning,  before  computing  the  optimum  number  of  links  and  nodes  it  is 
necessary  to  train  the  network  and,  once  they  have  been  identified,  the  reduce- 
size  network  has  to  be  retrained.  In  this  paper,  a  direct  method  to  obtain  an 
optimum  size  network  during  its  training  process  is  presented.  We  use 
orthogonal  transformations  for  computing  the  optimum  number  of  nodes  on 
each  iteration  of  the  training  process.  These  transformations  lead  to  a 
decorrelation  of  the  information,  which  is  the  key  of  network  size  reduction. 


1  Introduction 

The  back-propagation  algorithm  has  emerged  as  one  of  the  most  popular  for 
supervised  training  neural  networks.  This  algorithm  is  extremely  computation  and 
storage  demanding.  An  enormous  amount  of  computation  has  to  be  spent  on  training 
the  network  and,  in  the  retrieving  phase,  high  throughputs  are  required  for  real-time 
processing  which  hinges  on  its  massively  parallel  processing  capability. 

Multiprocessors,  array  of  processors  and  massively  parallel  processors  provide  a 
natural  solution  to  the  BP  algorithm,  which  can  be  expressed  in  basic  matrix 
operations,  such  as  inner-product,  outer-product  and  matrix  multiplications.  For 
instance,  this  kind  of  operations  can  be  mapped  to  basic  processor  arrays,  systolic  or 
wavefront  arrays.  They  have  the  following  key  advantages: 

•  The  exploitation  of  pipelining  is  very  natural  in  regular  and  locally  connected 
networks.  They  yield  high  throughput  and  simultaneously  save  the  cost  associated 
with  communication. 

•  They  provide  a  good  balance  between  computation  and  communication,  which  is 
critical  to  the  effectiveness  of  anay  computing. 
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An  open  question  related  to  neural  networks  is  how  to  determine  the  most 
appropriate  network  size  for  solving  an  specific  task.  To  be  representative,  the 
network  should  have  an  optimum  number  of  links  and  nodes.  Moreover,  from  an 
implementation  standpoint,  small  networks  only  require  limited  resources  in  any 
physical  computational  environment.  The  network  will  be  overparametrized  if  the 
number  of  links  is  very  high.  In  such  cases,  if  the  training  set  of  data  is  not  noise-free, 
the  NN  will  try  to  learn  the  information  along  with  the  noise  in  the  data,  leading  to 
poor  validation  results. 

There  are  several  approaches  to  solve  the  problem  of  determining  the  optimum 
size  of  a  neural  network.  The  first  approach,  called  growing  algorithm,  adds  gradually 
hidden  units  to  an  initial  small  network  until  it  reaches  the  convergence  [l]-[4].  The 
second  one,  known  as  pruning,  consists  of  training  a  larger  than  necessary  network, 
then  remaining  nodes  are  eliminated  and  finally  the  reduced-size  network  has  to  be 
retrained  [5]  [6]. 

Pratim  Kangilal  and  Narayan  Banerjee  [5]  have  proposed  an  approach  for  the 
optimization  of  the  size  of  feedforward  neural  networks  using  orthogonal 
transformations.  They  used  two  orthogonal  transformations,  the  singular  value 
decomposition  (SVD)  [7]  and  the  QR  with  column  pivoting  factorization  (QRcp)  [7]. 
Using  SVD,  the  rank  of  a  matrix  can  be  computed  and  so  the  optimum  number  of 
parameters  is  determined.  QRcp  coupled  with  SVD  is  used  for  subset  selection,  which 
is  the  key  of  the  design  of  optimal  networks. 

The  use  of  the  above  orthogonal  transformations  for  the  NN  size  optimization 
depends  on  which  nodes  (input  or  hidden  nodes)  are  going  to  be  optimized: 

QR.timum  number  of  input  nodes:  Let  PxN  matrix  A  comprises  the  input  data  sets, 
where  P  is  the  number  of  sets  of  data  points  (training  patterns),  and  N  is  the 
number  of  inputs.  The  aim  is  to  determine  which  of  the  N  features  are  relatively 
redundant  and,  hence,  can  be  eliminated.  Performing  SVD  on  A,  the  optimum 
number  of  input  nodes  of  the  neural  network  (say  L)  is  determined  for  the  input 
data  sets.  QRcp  provides  L  of  the  N  features,  for  the  P  sets  of  data  points,  which 
are  enough  for  a  correct  training  process. 

2-  Optimum  number  of  hidden  links  and  nodes:  Consider  a  network,  which  has  been 
trained  with  P  input  data  sets.  A  PxM  matrix  B  is  formed  with  the  M  pseudo 
outputs  of  the  concerned  hidden  layer  for  each  of  the  P  input  data  sets.  SVD  is 
performed  on  B  for  determining  the  enough  number  of  hidden  nodes  for  the  given 
network.  In  case  of  a  non-homogeneous  network,  i.e.  when  hidden  nodes  are  fed 
with  different  sets  of  inputs,  QRcp  transformation  is  performed  on  B  and  the 
specific  links  between  the  hidden  layers  to  be  retained  are  identified.  Once 
remaining  nodes  have  been  eliminated,  the  reduced-size  network  is  retrained. 
Castellano  et  al.  [6]  have  developed  a  pruning  algorithm  based  on  the  idea  of 
iteratively  removing  hidden  units  of  a  large  trained  network  and  then  adjusting  the 
remaining  weights  in  order  to  maintain  the  original  input-output  behavior. 

In  the  above  approaches,  it  is  necessary  to  train  the  network  before  computing  the 
optimum  number  of  hidden  nodes  and,  once  they  have  been  identified,  remaining 
weights  have  to  be  adjusted.  In  this  paper,  we  propose  a  method  to  obtain  a  network 
with  optimum  number  of  hidden  nodes  at  the  same  time  as  it  is  trained.  It  is  based  on 


542 


VECPAR  '98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


computing  the  optimum  number  of  hidden  units  on  each  iteration  of  the  training 
process,  and  then  updating  only  the  weights  connected  to  those  hidden  units. 


2  Orthogonal  Transformations 

In  order  to  compute  the  optimum  size  of  a  feedforward  neural  network,  we  apply 
orthogonal  transformations.  An  important  property  of  them  is  that  the  vector  2-norm, 
as  well  as  the  matrix  2-norm,  and  the  Frobenius  norm,  are  invariant  under  the 
application  of  this  kind  of  transformations. 

In  particular,  we  use  the  properties  of  Householder  reflections  for  computing  the 
optimum  number  of  hidden  nodes  on  each  iteration  of  the  training  process  of  the 
network.  These  transformations  are  described  as  follows  [7]: 

Let  V  e  91"  be  nonzero.  An  nxn  matrix  P  of  the  form 

P  =  I  -  2vv’'/v\  (1) 

is  called  a  Householder  reflection.  The  vector  v  is  called  a  Householder  vector. 

It  can  be  shown  easily  that  matrix  P  is  symmetric  and  orthogonal: 

•  Symmetric: 

?■"  =  !-  l{WfN\  =  I  -  2vvVv\  =  P  (2) 

•  Orthogonal: 

P^P  =  I  +  4vvVvWvv\  -  4vv^/v\  =  I  (3) 


Householder  reflections  can  be  used  to  zero  selected  components  of  a  vector. 
Given  a  vector  O^tx  e  91",  if  we  want  Px  to  be  multiple  of  e,  (the  first  column  of  the 
nxn  identity  matrix),  then,  for  any  x  e  91",  v  must  be  defined  as  follows: 


Px  =  (I  -  2vvVv\)x  =  x-(2v^x/v\)v 

(4) 

Setting  v=x+ae|  gives 

v’^x  =  x\  +  ax. 

(5) 

v\  =  x’^x  +  2ax,  + 

(6) 

If  we  assume  a=±  llxll,  (2-norm  of  the  vector  x) 

V  =  X  ±  llxlke,  =>  Px  =  (I  -  iWtv  \)\  =  ±  llxllje. 

(7) 

Given  m  vectors  6  91"  [X|,x, . xj,  Householder  reflections  are  used  to  determine 

which  of  them  are  linearly  independent.  Firstly,  a  Householder  matrix  to  zero  the 
last  n-I  components  of  x,  is  calculated.  Next,  the  vector  y=HiX^  is  obtained.  If 
Il3'’llj=lljr,llj,  then  x^  is  linearly  dependent  on  x,;  otherwise,  another  Householder  matrix 
(//j)  has  to  be  computed  to  zero  the  last  n-2  components  of  y,  and  matrix  H,  must  be 
updated  with  the  product  W//,.  The  vector  y’  is  formed  by  the  I  first  components  of 
vector  y,  where  I  represents  the  number  of  linearly  independent  vectors  obtained  on 
each  step.  Now  the  vector  y  is  obtained  by  the  product  of  H,  and  x,  and  the  equality 
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ll)’’llj=IU,llj  is  proved  to  determine  the  correlation  degree  among  x„  and  x,. 
Remaining  vectors  are  used  of  the  same  way  to  prove  the  linear  dependencies 
between  all  of  them. 

For  instance,  assume  that  we  have  three  vectors  e  a:,,  and  a:,,  and  that  a:,  is  a 
linear  combination  of  x,  and  x^.  In  such  case,  the  linear  dependency  between  those 
vectors  can  be  observed  applying  Householder  reflections.  The  first  step  is  to 
compute  the  Householder  matrix  (//,)  that  transforms  x,  into  a  multiple  of  e,.  Next, 
the  vector  y=H,Xj  is  computed.  It  can  be  observed  that  the  equality  lly’ll^=llA:jll/,  where 
y  is  the  first  component  of  y,  doesn’t  hold  due  to  x^  and  jCj  are  linearly  independent.  A 
new  Householder  reflection  is  computed  in  order  to  zero  the  last  n-2  components  of 
vector  y.  To  prove  the  linear  dependency  of  x„  x,  and  a  new  vector  y,  has  to  be 
obtained  by  the  product  As  x,  is  a  linear  combination  of  x,  and  x^,  it  can  be 

expressed  as  ax,+bx,  (with  a  and  b  e  91),  so  the  product  can  be  obtained  as 

follows: 

y,  =  H,H,x,  =  HjH,(ax,  +  bXj)  =  aHjHjX,  +  bH^y  (8) 

y,  =  a[c,0...0f +  b[d,dj0...0f  =  [n,n,0...  Of  (9) 

Equation  (9)  shows  that  all  the  components  of  y^  are  equal  to  zero,  excepting  the 
two  first  ones.  Since  Householder  matrices  are  orthogonal,  the  equality  lly^||j=||x,ll 
holds.  So,  if  the  2-norm  of  x^  can  be  computed  using  only  the  two  first  components  of 
j’j,  the  linear  dependency  among  x,,  Xj  and  x,  is  verified. 


3  The  Proposed  Optimizing  and  Training  Algorithm 

The  optimization  of  the  size  of  a  feedforward  neural  network  is  a  very  important  issue 
of  its  design,  since  any  network  should  have  an  optimum  number  of  links  and  nodes 
to  be  representative.  This  aim  can  be  achieved  retaining  only  the  most  representative 
nodes  and  deleting  all  the  others.  The  selection  process  hinges  upon  the  linear 
dependency  of  the  nodes.  For  instance,  assume  a  feedforward  neural  network  with 
three  nodes  on  its  hidden  layer,  where  the  output  value  for  the  third  hidden  node  is 
linearly  dependent  on  the  output  values  of  the  rest  of  the  hidden  nodes  for  the  set  of 
training  patterns.  In  such  case,  the  third  hidden  node  could  be  eliminated  due  to  the 
net  inputs  of  the  subsequent  layer  can  be  obtained  using  only  the  first  two  hidden 
nodes. 

The  method  we  propose  in  this  paper  is  based  on  the  idea  of  determining,  on  each 
iteration  of  the  training  process,  the  number  of  linearly  independent  outputs  of  the 
hidden  layer,  say  /,  and  then  updating  only  the  weights  of  the  links  connected  with  the 
first  /  hidden  nodes. 

In  order  to  compute  the  optimum  number  of  hidden  nodes  using  Householder 
reflections,  the  N  outputs  of  the  concerned  hidden  layer  have  to  be  obtained  for  each 
pattern  of  the  set  of  training  patterns.  Thus,  P  A-dimensional  vectors  are  formed 
containing  the  outputs  at  the  hidden  layer  for  the  input  data  set,  where  P  represents 
the  total  number  of  training  patterns.  After  presentation  of  the  first  training  pattern. 
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the  first  vector  x,  is  obtained  and  the  Householder  reflection  //,  to  zero  the  last  N- 1 
components  of  this  vector  is  computed.  Next,  for  each  pattern  i,  a  new  vector  x  is 
composed  and  it  is  proved  its  linear  dependency  with  regards  to  the  t-l  vectors 
computed  in  previous  steps. 

Assume  that  L  is  the  number  of  linearly  independent  vectors  found  at  the  i-th  step. 
This  means  that  the  Householder  matrix  H  computed  in  previous  steps  zeroes  at  least 
the  last  N-L  components  of  each  vector  of  the  set  [x,,  ....  x-J.  If  the  new  vector  x.  is 
linearly  dependent  on  [x,, ...,  x.  J,  then  the  product  //x,  must  be  a  vector  of  the  form: 

Hx,  =  [n. ...  n^^  0  ...  Of  (10) 

However,  if  equation  (10)  does  not  hold,  matrix  H  has  to  be  updated  using  a  new 
Householder  reflection  to  zero  the  last  N-L-l  components  of  the  vector  obtained 
in  (10).  Thus,  matrix  H  must  be  computed  as  the  product 


3.1  Our  Algorithm 

Consider  a  feedforward  neural  network  with  N  input  nodes,  M  hidden  nodes  and  O 
output  nodes  and  P  training  patterns.  Assume  that  L  is  the  optimum  number  of  hidden 
nodes  computed  at  each  iteration  of  the  algorithm,  being  L  equal  to  M  at  the 
beginning  of  the  first  iteration.  The  proposed  optimizing  and  training  algorithm  is  as 
follows: 

1)  Update  the  connection  weights  {w.)  from  the  input  layer  to  the  hidden  layer  for 
each  of  the  P  training  patterns,  using  the  back-propagation  algorithm: 

w,  =  w.  +  a  I(^p^  /’(Net.)Xp,  1  <  p  <  P  (11) 

where/is  the  activation  function  of  each  neuron  j,  a  is  a  constant  which  determines 
the  learning  rate,  x,,,  is  the  i-th  input  of  the  pattern  p,  is  the  error  of  the  k-th  output 
node  for  the  pattern  p  and  w\.  is  the  conneetion  weight  from  the  j-th  hidden  node  to 
the  k-th  output  node. 

Sinee  w\.  is  zero  for  j  greater  than  L,  only  the  weights  connected  to  the  first  L  hidden 
nodes  will  be  updated. 

2)  Compute  the  number  of  n  on-redundant  hidden  nodes  (Z.)  and  update  the 
connection  weights  from  such  nodes  to  the  subsequent  layer.  At  the  beginning  of  this 
step,  L  is  equal  to  zero. 

2.1)  Obtain  a  vector  x^,  formed  by  the  hidden  outputs  for  the  concerned 
training  pattern  (p).  Next,  a  new  vector  y  is  computed  by  the  product  //x^,, 
being  MxM  matrix  H  the  product  of  all  the  Householder  reflections 
computed  at  the  previous  p-\  iterations  of  this  step.  At  iteration  1,  //  is  the 
MxM  identity  matrix. 

2.2)  In  order  to  prove  if  vector  x,,  is  linearly  dependent  on  the  vectors  (x,,  .... 
formed  by  the  hidden  outputs  for  the  previous  p- 1  training  patterns,  the 

following  equation  has  to  be  verified: 

lly’ll,  =  11x^11,  (12) 
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where  )>’  is  an  L-dimensional  vector  composed  by  the  first  L  components  of 
vector  y. 

If  equation  (12)  holds,  then  is  linearly  dependent  on  (x,,  jc,,). 

Otherwise,  the  optimum  number  of  hidden  nodes  is  increased  (L=L+ 7)  and 
matrix  H  is  updated  with  the  product  H’H,  where  //’  is  the  Householder 
matrix  that  zeroes  the  last  M-L-1  components  of  vector  y. 

2.3)  Update  the  connection  weights  (w’^  from  the  first  L  hidden  nodes  to  the 
subsequent  layer: 

w’i,  =  w’.,  +  a(d^.-y^p/’(Negy^  (13) 

where  d^.  and  y^,.  are  the  desired  and  obtained  outputs  of  the  j-th  output  node 
for  the  pattern  p,  respectively,  and  y^^  is  the  output  of  the  k-th  hidden  node 
for  such  pattern. 

2.4)  Go  to  step  2.1  until  connection  weights  of  the  concerned  hidden  layer 
are  updated  for  all  the  training  patterns. 

3)  Go  to  step  1  until  the  network  reaches  the  convergence. 

Remarks  of  the  algorithm: 

1.  Network  outputs  are  computed  using  only  the  optimum  hidden  nodes  of  the 
previous  algorithm  iteration.  So,  at  the  beginning  of  the  algorithm,  network  outputs 
are  obtained  considering  M  hidden  nodes. 

2.  At  step  2.2,  it  is  not  necessary  to  calculate  explicitly  matrix  H’  and  then  compute 
the  product  H’H,  since  the  structure  of  a  Householder  reflection  can  be  applied 
directly  for  updating  a  matrix. 

3.  The  initial  number  of  hidden  units  depends  on  the  specific  problem  to  solve. 
However  it  will  be  always  less  or  equal  than  the  number  of  training  patterns. 

4.  Once  the  network  is  trained,  the  last  M-L  nodes  of  the  hidden  layer  can  be 
eliminated,  since  its  weights  to  the  subsequent  layer  are  zero. 

5.  In  case  of  a  network  with  more  than  one  hidden  layer,  once  the  weights  of  the  first 
hidden  layer  have  been  updated,  step  2  has  to  be  applied  again  for  the  subsequent 
hidden  layers. 


3.2  Comparison  betvreen  the  original  back-propagation  method  and  our 
optimizing  and  training  algorithm 

In  order  to  show  the  performance  of  our  algorithm,  we  make  a  comparison  in  terms  of 
computational  cost  between  this  approach  and  the  original  back-propagation 
algorithm. 

The  following  table  shows  the  differences  on  number  of  operations  between  both 
algorithms  assuming  an  NxMxO  neural  network,  P  training  patterns  and  L  optimum 
hidden  nodes. 
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Table  1.  Number  of  operations  at  different  steps  of  both,  the  original  back-propagation 
algorithm  and  the  optimizing  and  training  algorithm. 


Step 

Back-propagation 

Algorithm 

Optimizing  and  Training 
Algorithm 

Compute  Network 
Outputs 

PxNxM  +  PxMxO 

PxNxM  -I-  PxLxO 

Update  Input  Layer 
Weights 

PxNxM 

PxNxL 

Compute  Optimum 
Number  of  Hidden  Nodes 

- 

PxMxL 

Compute  Householder 
Reflections 

- 

LxMx(M+l) 

Update  Hidden  Layer 
Weights 

PxMxO 

PxLxO 

Network  outputs  are  obtained  applying  the  following  equations: 


N 


’  -./(Zwjj  Xj) 

1  <j<M 

(14) 

M 

=y(Iw'jk  y'j) 

1  <k<0 

(IS) 

where  are  the  outputs  of  the  hidden  layer,  y,  are  the  outputs  of  the  output  layer,  jc, 
are  the  network  inputs  and  Wj.  and  are  the  weights  of  input  and  hidden  layers, 
respectively. 

In  the  original  Back-propagation  algorithm,  M  hidden  nodes  are  used  to  compute 
the  network  outputs  for  each  training  pattern,  so  PxNxM  +  PxMxO  operations  are 
required.  In  the  optimizing  and  training  algorithm  L  hidden  nodes  are  only  needed  to 
compute  the  outputs  of  the  last  layer.  However  the  M  outputs  of  the  hidden  layer  have 
to  be  obtained  in  order  to  prove  equation  (12),  so  this  step  entails  PxNxM  +  PxLxO 
operations  for  the  proposed  algorithm. 

To  compute  the  optimum  number  of  hidden  nodes,  it  is  necessary  to  verify 
equation  (12)  at  each  process  iteration.  Vector  y’  is  obtained  by  the  first  L 
components  of  the  product  Hx,  so  MxL  operations  are  needed  for  each  iteration  at  this 
step.  Since  P  is  the  number  of  iterations,  the  total  number  of  operations  is  PxMxL. 

If  equation  (12)  does  not  hold,  a  new  Householder  reflection  H’  is  computed  and 
matrix  H  has  to  be  updated  with  the  product  H’H.  Instead  of  forming  explicitly  matrix 
H’  and  then  computing  H’H,  which  implies  a  matrix-matrix  multiplication,  the 
structure  of  H’  can  be  applied  directly  using  the  equation: 

H’H  =  (I  -  2vvVv\)H  =  H  -  v(2v^H/v\)  (16) 

where  v  is  the  Householder  vector  for  the  matrix  H’. 

Thus,  a  Householder  update  of  a  matrix  involves  a  matrix-vector  multiplication 
followed  by  an  outer  product  update,  which  entails  Mx(M-i-l)  operations. 
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Since  L  is  the  optimum  number  of  hidden  nodes  computed  by  the  algorithm,  L 
Householder  reflections  are  needed  to  prove  equation  (12).  Hence,  the  total  number 
of  operations  required  on  this  step  is  LxMx(M+l). 

It  should  be  taken  into  account  that  the  number  of  operations  of  table  1  for  the 
optimizing  and  training  algorithm  is  an  upper  limit  of  the  actual  number  of 
operations.  It  is  due  to  the  optimum  number  of  hidden  nodes,  at  any  step  of  the 
process,  is  always  less  or  equal  than  L  Moreover,  the  number  of  hidden  units  used  on 
each  process  iteration  depends  on  the  order  of  presentation  of  training  patterns,  so 
establishing  a  general  quantitative  comparison  between  both  algorithms  is  a  difficult 
task.  This  evaluation  must  be  done  for  an  specific  network  application 


4  Simulation  Results 

To  test  the  effectiveness  of  our  algorithm,  the  chaotic  time-series  generated  by  the 
Mackey-Glass  equation  have  been  studied  using  three-layer  feedforward  networks. 

A  system  is  said  to  be  chaotic  if  the  evolutionary  trajectory  of  the  system  is 
generated  by  a  deterministic  mechanism,  but  it  is  very  sensitive  to  the  system’s  initial 
condition  [8].  Since  under  certain  conditions  a  chaotic  system  behaves  randomly,  the 
identification  of  such  system  is  difficult.  Under  those  conditions,  a  model  capable  of 
identifying  the  underlying  deterministic  mechanism  can  greatly  improve  system 
performance,  predictability  and  control. 

The  discrete  time  representation  of  the  Mackey-Glass  equation  is  given  by 

x(k+ 1 )  -  x(k)  =  ax(k-T)/(  1  +  x^(k-T))  -  px(k)  (17) 

Consider  the  series  generated  with  a=0.2,  P=0.1,  y=10  and  t=17.  This  combination 
generates  a  quasiperiodic  time  series,  where  a  quasiperiodic  process  is  a  linear 
combination  of  several  periodic  processes. 

The  objective  is  to  model  the  Mackey-Glass  series  to  produce  ahead  predictions. 
The  Mackey-Glass  series  { x(k) }  can  be  expressed  as 

x(k+p)  =y(x(k),  x(k-T),  x(k-2T), ...,  x(k-(yV-l)T))  (18) 

where  p  is  the  prediction  time,  which  is  chosen  according  to  the  need  for  long-term  or 
short-term  prediction,  and  N  is  generally  between  four  and  eight  [8]  [9].  We  have 
chosen  N=6,  so  a  six-input  neural  network  is  considered  where  x(k),  x(k-T),  ...,x(k-5T) 
are  used  as  the  inputs  and  x(k+p)  is  used  as  the  output. 

Simulation  results  have  been  obtained  from  several  neural  networks  with  different 
number  of  hidden  units  using  300  data  sets  for  training.  For  each  of  those  neural 
networks  both,  the  back-propagation  algorithm  and  the  optimizing  and  training 
algorithm,  have  been  applied.  When  the  proposed  method  is  applied,  a  reduced  6x3x1 
network  is  obtained. 
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Fig.  1.  Training  length  for  several  networks  with  different  number  of  hidden  nodes 


Fig.  2.  Iteration  length  for  MG  series  using  networks  with  different  number  of  hidden  units. 

Figure  1  and  2  show  the  training  and  iteration  lengths  using  different  number  of 
hidden  nodes  in  the  proposed  and  the  original  back-propagation  algorithms.  As  it  can 
be  seen,  although  training  time  increases  for  large  networks  in  both  algorithms,  the 
optimizing  and  training  method  provides  better  results  than  the  back-propagation 
algorithm,  even  when  the  optimum  number  of  hidden  nodes  is  near  to  the  initial 
number  of  hidden  nodes. 


Fig.  3.  Mackey-Glass  series  modeled  using  6x20x1  and  6x3x1  networks. 
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Fig.  4.  Number  of  iterations  required  for  the  back-propagation  algorithm  and  the  proposed 
method. 

The  representation,  in  figure  3,  of  the  Mackey-Glass  series  modeled  using  a 
6x20x1  network,  trained  with  the  original  back-propagation  algorithm,  and  a  6x3x1 
network,  obtained  by  means  of  the  proposed  method,  shows  that  the  performance  of 
both  networks  is  equally  good. 

Figure  4  shows  the  number  of  iterations  required  to  train  the  networks.  From  the 
results  obtained  we  can  observed  that  small  networks  need  less  number  of  iterations 
than  large  networks  to  reach  a  low  mean-squared  error  (MSB).  However  the  learning 
speed  depends  on  many  other  factors  such  as  weights  initialization  and  learning  rate 
parameter  (a). 


5  Conclusions 

In  this  paper  a  method  for  training  and  reducing  the  size  of  feedforward  neural 
networks  has  been  presented.  The  key  idea  of  this  approach  consists  of  iteratively 
computing  the  optimum  hidden  nodes  and  then  updating  only  the  weights  connected 
to  those  nodes.  Using  this  method  the  retraining  process  of  the  reduce-size  network  is 
avoided. 

We  apply  Householder  reflections  to  compute  the  optimum  network  size  on  each 
process  iteration.  These  orthogonal  transformations  lead  to  a  decorrelation  of  the 
network  information  using  few  operations,  which  accelerate  the  training  process. 

From  experimental  results,  an  improvement  on  the  network  training  length  can  be 
observed  with  regards  to  the  original  back-propagation  algorithm  and  hence,  in 
relation  to  existing  pruning  approaches. 

The  proposed  algorithm  can  be  expressed  in  basic  matrix  operations  and  so  its 
implementation  can  be  easily  achieved  using  processor  arrays,  systolic  or  wavefront 
arrays. 
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Abstract.  The  Versatile  Advection  Code  is  a  single  scientific  software 
package  designed  and  implemented  to  solve  various  hydrodyncimic  and 
magnetohydrodynamic  problems  typical  of  astrophysical  research.  It  runs 
on  workstations,  and  on  vector  and  parallel  supercomputers  as  well.  The 
versatility  for  applications  is  ensured  by  the  Loop  Annotation  Syntax 
preprocessor  and  the  modular  design  of  the  softwcire,  while  portability 
to  different  hardware  platforms  is  achieved  by  the  preprocessors  that  can 
translate  the  code  from  Fortran  90  both  to  High  Performance  Fortran 
cind  Fortran  77.  Performance  results  are  presented  for  several  platforms. 


1  Introduction 

The  Versatile  Advection  Code  (VAC)  [1,2]  has  been  developed  since  1994  as  a 
general  purpose  tool  for  hydrodynamic  and  magnetohydrodynamic  astrophysical 
applications.  VAC  uses  various  shock  capturing  numerical  methods  [3],  explicit, 
semi-implicit,  or  fully  implicit  time  stepping  [4,5]  on  1,  2,  or  3  dimensional 
finite  volume  grids.  The  software  package  is  complete  with  120  pages  of  manual 
written  in  hypertext,  a  user  interface  based  on  web  browsers,  and  visualization 
macros  for  the  most  popular  visualization  softwares.  The  ever  growing  number  of 
users  and  applications  proves  that  the  concept  of  a  single  well  designed  general 
purpose  scientific  software  package  is  a  good  alternative  to  the  typical  specialized 
scientific  codes. 

The  most  original  software  solution  in  VAC  is  the  Loop  Annotation  Syntax 
(LASY)  [6],  which  was  developed  to  provide  a  compact  notation  for  expressions 
occuring  in  a  multidimensional  hydrodynamic  code  independent  of  the  number 
of  represented  spatial  dimensions.  The  other  important  feature  is  the  modular 
design,  which  allows  VAC  to  solve  different  equations  with  different  methods,  and 
lets  the  user  add  extra  terms  in  the  equations,  define  special  initial  and  boundary 
conditions,  or  specify  non-default  input/output  data  format  by  writing  a  few  well 
specified  subroutines. 

VAC  is  designed  from  the  beginning  to  run  on  workstations,  where  most  sci¬ 
entists  do  their  simulations,  and  on  vector  and  pairallel  supercomputers,  required 
for  big  2D  and  3D  simulations,  as  well.  The  source  code,  after  it  is  translated 


553 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


from  the  LASY  notation,  uses  Fortran  90  (F90)  array  syntax  and  High  Per¬ 
formance  Fortran  (HPF)  style  FORALL  statements  for  all  the  expressions  that 
operate  on  the  whole  computational  grid.  Thus  it  is  easy  to  add  HPF  compiler 
directives  in  an  automated  fashion  and  run  the  code  on  a  parallel  machine  under 
HPF.  It  is  also  trivial  to  translate  the  FORALL  statements  back  to  ordinary  DO 
loops  for  a  Fortran  90  compiler  on  a  non-parallel  machine. 

Although  Fortran  90  is  becoming  available  on  most  scientific  computing  fa¬ 
cilities,  it  is  still  necessary  to  be  able  to  translate  the  source  code  to  Fortran  77 
(F77).  A  simple  translator  program  is  implemented  to  carry  out  this  task  for 
the  limited  number  of  language  constructs  that  are  used  from  the  rich  Fortran 
90  language.  Not  using  all  the  features  of  F90  is  a  restriction  for  the  developer, 
but  It  is  beneficial  for  the  users,  who  are  more  familiar  with  the  simpler  F77  lan¬ 
guage,  and  for  the  compilers,  which  usually  do  a  better  job  on  simpler  program 
constructs. 

2  Preprocessors 

The  use  of  the  preprocessors  can  be  best  demonstrated  on  a  small  piece  of  code. 
The  purpose  of  the  gradient  subroutine  is  simple:  calculate  the  gradient  gradq 
of  the  quantity  q  in  direction  idir  within  a  rectangle,  defined  by  ix-L  indices. 
From  the  actual,  more  general,  subroutine  used  in  VAC,  I  extracted  the  part 
which  is  valid  for  Cartesian  grids  and  uses  central  differences.  The  subroutine  is 
shown  in  Figure  1. 


subroutine  gradient (q, ix‘L , idir , gradq) 

include  ’vacdef.f90’  ' 

double  precision::  q(ixG*T) ,gradq(ixG*T) 
integer::  ix"L,idir, jx'L.hx'L 

! SHIFT 

jx"L=ix*L+kr(idir,*D) ; 

! SHIFT  MORE 

bx"L=ix*L-kr(idir,*D) ; 

! SHIFT  BEGIN 

gradqCix  S)=0. 5D0*(q(jx*S)-q(hx*S))/dx(ix“S,idir) 
! SHIFT  END 

return 

end 


Fig.  1.  Example  source  code  with  LASY. 


The  included  vacdef  .f90  file  declares  the  global  parameters  and  variables. 
The  array  dimensions  ixG*T,  the  grid  spacing  dx(ixG*T,ndim),  and  the  Kro- 
necker  delta  array  kr(3,3),  which  is  used  to  shift  indices  in  a  certain  direction. 
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are  all  declared  and  initialized  before  this  subroutine  is  called.  The  meaning  of 
the  LASY  patterns  starting  with  the  special  character  '  is  briefly  the  following; 
‘D  stands  for  dimensions,  'L  for  limits,  "S  for  array  segments,  and  *T  for  the  to¬ 
tal  size  of  arrays.  The  VAC  Preprocessor  (VACPP)  substitutes  the  patterns  with 
substitute  strings,  whose  number  depends  on  the  number  of  spatial  dimensions, 
which  is  a  parameter  for  VACPP.  The  preprocessor  not  only  replaces  the  pat¬ 
terns  with  their  substitute  strings,  but  it  also  repeats  the  source  code  attached 
to  the  pattern,  and  the  repetitions  are  separated  appropriately.  The  detailed 
rules  of  LASY  are  described  in  [6],  here  I  simply  show  the  code  translated  to  2 
dimensions  in  Figure  2. 


subroutine  gradient (q, ixminl , ixinin2 , izmsizl ,  izinax2 , idir , gradq) 
include  ’vacdef.fQO’ 

integer ; :  ixminl ,  ixinin2 ,  izmaxi ,  ixmax2 ,  idir ,  t 

j  xminl ,  j  xmin2 ,  j  xmaz  1 .  j  zmaz2 .  hzminl ,  hxmin2 ,  hxmax  1 .  Iixmax2 
double  precision::  q(izGlol:ixGhil,ixGlo2:ixGhi2) 
gradq (izGlo 1 : izGhil , izGlo2 : izGhi2) 

! SHIFT 

jxminl=ixminl+kr(idir,l) ; jxmin2=ixmin2+kr(idir ,2) ; 
jxmaxl=ixmaxl+kr(idir,l) ; jxmax2=ixmax2+kr(idir,2) ; 

! SHIFT  MORE 

hxminl=ixminl-kr(idir,l) :hxmin2=ixmin2-kr(idir ,2) ; 
hxmaxl=ixmaxl-kr(idir,l) ;hxmax2=ixmax2-kr(idir,2) ; 

! SHIFT  BEGIN 

gradq (ixminl : izmaxi , ixmin2 : ixmax2) =0 . 5D0*ft 
(q ( jxminl : jxmaxl , jxmin2 : jzmax2) ft 
-qChxminl :hzmaxl,hxmin2:hxmax2))ft 
/dx ( ixminl : izmaxi , ixmin2 : ixmaz2 , idir) 

! SHIFT  END 

return 

end 


Fig.  2.  Source  code  translated  to  Fortran  90  for  2  spatial  dimensions. 


It  is  quite  easy  to  imagine  what  the  1  or  3  dimensional  versions  would  look 
like.  Clearly,  the  LASY  notation  is  not  only  more  general,  but  also  more  compact 
than  the  translated  F90  source  code.  The  VACPP  preprocessor  is  implemented 
as  the  vacpp.pl  Perl  script. 

In  case  the  user  has  no  F90  compiler  available,  the  Fortran  90  source  is 
further  translated  to  Fortran  77  by  the  f90tof77  Perl  script.  The  translation 
changes  the  free  source  format  to  flxed  one,  and  replaces  the  array  syntax  by 
do  loops.  The  f90tof77  script  can  also  deal  with  the  differences  between  F90 
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subroutine  gradient (q, ixminl , i»nin2 , ixmaxl , ixmax2 , idir , gradq) 
include  ’ vacdef . f ’ 

integer  iiminl , ixmin2 , ixmaxl , ixmax2 , idir , 

&  jzisinl ,  jz]iu.n2 ,  jxinaxl ,  jxniax2  .hxminl  ,hxinin2 ,  hxmazl  ,hxinaz2 

double  precision  q(ixGlol:izGhil,ixGlo2:ixGhi2) . 
ft  gradq (ixGlo 1 ; ixGhil , izGlo2 ; ixGhi2) 

♦SHIFT 

jxininl=ixminl+kr(idir ,  1) 
j  xmin2=ixinin2+kr  ( idir  ,2) 
jxinaxl=ixmaxl+kr(idir ,  1) 
j  xmax2=ixmax2+kr ( idir ,2) 

♦SHIFT  MORE 

hxminl=ixminl-kr ( idir , 1 ) 
hxmin2=ixinin2-kr  (idir ,  2) 
hxinaxl=ixmaucl-kr  (idir ,  1) 
hzmax2=ixinax2-kr  (idir ,  2) 

♦SHIFT  BEGIN 

do  ix-2=ixmin2,ixinax2 
do  ix.l=ixminl  .ixmzutl 
gradq ( ix.l , ix J )  =0 . 5D0^ 

ft  (q(i3t-l+(j»ninl-ixminl)  ,ix_2+(jxmin2-ixiiiin2)) 

ft  -q(ix.l+  (hxminl-ixminl) , ix-2+  (hxniin2-ixmin2) ) ) 

ft  /dx( ix.l, ix.2. idir) 

enddo 
enddo 

♦SHIFT  END 

return 

end 

Fig.  3.  Source  code  further  translated  to  Fortran  77. 


556 


VECPAR'98  ■  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


and  F77  regarding  the  variable  declaration,  and  it  can  translate  some  functions 
like  sum,  product,  maxval,  minval,  any,  all,  which  operate  on  arrays  and 
return  scalars.  The  where,  forall,  case  constructs  can  also  be  translated. 
Other  features  of  Fortran  90,  like  dynamic  allocation,  modules,  array  valued 
functions,  pointers,  structures,  etc.  are  not  used  in  VAC,  and  cannot  be  trans¬ 
lated  by  f90tof77,  which  is  a  short  and  simple  program.  The  gradient  subroutine 
in  2  dimensions  and  in  F77  is  shown  in  Figure  3.  The  loop  variables  ix_l ,  ix_2 
are  declared  in  the  included  file. 


subroutine  gradient  (q,  izminl ,  izmin2 , ixmaxl , ixinax2 , idir , gradq) 
include  ’vacdef.hpf’ 

integer;:  ixminl,ix]nin2, ixmaxl, izinaz2, idir, ft 

jxminl ,  j  xmin2 ,  j  xmaxl ,  j  xmax2 ,  hxminl ,  hxiiiin2 ,  hxmaxl ,  hxmax2 
double  precision::  q(ixGlol:ixGhil,ixGlo2:ixGhi2) ,ft 
gradqCixGlol : ixGhil , ixGlo2 : ixGhi2) 

!HPF$  DISTRIBUTE  qCBLOCK,*)  ONTO  PP 
!HPF$  DISTRIBUTE  gradq(BL0CK,*)  ONTO  PP 

[SHIFT 

jxminl=ixminl+kr(idir,l) ; jxmin2=ixmin2+kr(idir,2) ; 
jxmaxl=ixmaxl+kr(idir,l) ; jxmax2=ixmai2+kr(idir ,2) ; 

[SHIFT  MORE 

hxminl=ixminl-kr(idir ,1) ;hxmin2=ixmin2-kr(idir ,2) ; 
hxmaxl=ixmaxl-kr(idir,l) ;hxmax2=ixmax2-kr(idir ,2) ; 

[SHIFT  BEGIN 

IF  (hxminl=*ixminl-l.and.hxmin2==ixmin2.and.ft 
jxiiiinl==ixminl+l.and. jxmin2==ixmin2)  THEN 
gradqCixminl : ixmaxl , ixmin2 : ixmax2) =0 . 5D0*ft 
(q(ixminl+l : ixmaxl+1 , ixmin2 : ixmax2) ft 
-q(ixminl-l : ixmaxl-1 , ixmin2 ; ixmax2) )ft 
/dx ( ixmin 1 : ixmax 1 , ixmin2 : ixmax2 , idir ) 

ELSE  IF  (hxminl==ixminl .  auid .  hxmin2==ixmin2- 1 .  and .  ft 
jxminl==ixminl . and . jxmin2==ixmin2+l)  THEN 
gradqCixminl : ixmaxl , ixmin2 : ixmax2) =0 . 5D0*& 

(q (ixminl : ixmaxl , ixmin2+l : ixmax2+l) ft 
-q ( ixminl ; ixmax 1 , ixmin2- 1 : i xmax2- 1 ) ) ft 
/dx  ( ixminl :  ixmax  1 ,  ixzsin2 ;  ixmax2 ,  idir ) 

ELSE 

stop  ’SHIFT  did  not  optimizel’ 

ENDIF 
[SHIFT  END 

return 

end 


Fig.  4.  Source  code  with  HPF  directives  and  optimized  index  shifts. 
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The  fQOtohpf  script  inserts  the  HPF  directives  into  the  Fortran  90  source 
code  automatically.  All  arrays  defined  on  the  full  grid  are  declared  with  the 
ixGlol:ixGhil, , . .  index  limits,  and  they  can  be  distributed  among  the  pro¬ 
cessors  according  to  the  parameters  given  to  fQOtohpf.  On  different  parallel  ar¬ 
chitectures  and/or  for  different  problem  sizes,  different  distributions  may  be 
optimal.  The  automatic  insertion  of  the  directives  makes  it  extremely  simple  to, 
e.g.,  change  a  (BLOCK,BLOCK)  distribution  to  (BLOCK,*)  or  (*, BLOCK). 

Unfortunately,  HPF  compilers  are  not  as  mature  as  F77  or  F90  compilers. 
Several  HPF  compiler  bugs  were  found  while  VAC  was  tested  on  parallel  comput¬ 
ers.  Due  to  the  simplicity  of  the  source  code,  there  were  relatively  few  problems, 
and  they  could  be  avoided  relatively  easily.  Even  if  the  code  compiles  and  runs 
correctly,  the  performance  can  be  very  poor  if  the  HPF  compiler  does  not  rec¬ 
ognize  the  simple  shift  operations  in  the  gradient  subroutine  and  elsewhere 
in  the  source.  The  general  global  communication  is  much  slower  than  the  fast 
specialized  shifts,  which  are  supported  by  the  hardware  and  the  communication 
libraries  of  most  parallel  computers.  To  help  the  compiler,  the  VAC  preproces¬ 
sor  can  replace  the  general  shift  statement  marked  with  the  !  SHIFT  comments, 
with  shifts  in  specific  directions  placed  in  the  appropriate  branches  of  an  if, 
else  if  construct.  The  resulting  code,  shown  in  Figure  4,  is  longer  and  more 
difficult  to  read,  but  it  usually  compiles  to  a  faster  code  under  HPF.  The  phys¬ 
ical  layout  of  processors  PP  is  defined  in  the  include  file.  When  only  one  spa¬ 
tial  dimension  is  distributed,  one  can  use  the  HPF  directive  !HPF$  PROCESSORS 
PP  (NUMBERJDF-PROCESSORS  (  )  ) . 

The  code  can  also  be  translated  to  Connection  Machine  Fortran  (CMFortran) 
with  the  fSOtocmf  script.  Unfortunately  the  CM  Fortran  compiler  recognizes 
index  shifts  for  a  very  limited  type  of  syntax,  thus  communication  is  not  optimal 
without  rewriting  the  critical  shifts  by  hand.  In  principle,  one  could  automate 
this  optimization,  but,  since  CM  Fortran  is  disappearing  from  the  scene,  there 
is  little  motivation  to  write  the  necessary  Perl  script. 

3  Results  and  Conclusions 

VAC  is  being  used  by  approximately  25  researchers,  mostly  astrophysicists.  Most 
applications  are  hydrodynamic  and  magnetohydrodynamic  simulations,  but  VAC 
is  also  used  as  a  test  suite  for  different  numerical  methods.  Most  users  have  access 
to  powerful  workstations,  thus  the  code  has  been  tested  and  used  on  DEC,  SUN, 
IBM,  SGI,  HP  workstations,  and  even  on  Pentium  PC-s  under  LINUX. 

Due  to  the  simplicity  of  the  loops,  which  is  implied  by  the  F90  array  syntax, 
the  code  vectorizes  very  well.  On  a  single  node  of  the  traditional  vector  super¬ 
computer  Cray  C90,  VAC  runs  about  23  times  faster  than  on  a  DEC  Alpha/400 
workstation,  while  the  ratio  is  4.2  for  the  J90.  These  measurements  were  done 

for  a  specific  problem  [7],  but  the  speed  ratios  are  typical  for  all  timings  tried  so 
far. 

VAC  has  also  been  tested  on  the  IBM  SP,  Cray  T3E,  Cray  T3D,  and  Con¬ 
nection  Machine  5  (CMS)  parallel  machines,  and  on  a  cluster  of  workstations, 
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under  different  HPF  compilers  [8,7].  The  scaling  is  close  to  linear  up  to  8  pro¬ 
cessors  on  the  IBM  SP  and  on  the  Cray  T3E  for  a  rather  moderate  and  fixed 
problem  size,  which  proves  that  good  scaling  is  possible  under  HPF  even  for  a 
code  as  complex  as  VAC.  The  single  node  performance  is  a  factor  of  5.2  and 
1.7  improvement  relative  to  the  DEC  Alpha/400  workstation  for  the  SP  and  the 
T3E  machines,  respectively.  On  a  16-node  CM5,  after  optimizing  the  array  shift 
operations  by  hand,  the  code  runs  about  15  times  faster  than  on  the  DEC  Alpha. 
VAC  was  tested  on  a  cluster  of  workstations  as  well.  The  code  compiled  and  ran 
successfully,  but  the  multiuser  environment  did  not  allow  for  meaningful  timing. 

The  Versatile  Advection  Code  proves  that  it  is  possible  to  write  one  source 
code  for  several  different  applications  and  computer  platforms  with  the  aid  of 
simple  but  powerful  preprocessor  and  translator  programs.  All  the  preprocessor 
programs,  vacpp.pl,  f90tof77,  f90tohpf,  f90tocmf,  forall2do,  are  im¬ 
plemented  in  Perl,  which  is  a  free  software,  and  is  installed  on  almost  all  scien¬ 
tific  computers.  Actually,  the  preprocessing  step  and  the  final  compilation  can 
be  done  on  different  computers  if  necessary. 

Currently  we  are  working  on  the  HPF  compatible  implementation  of  the 
implicit  time  stepping  module.  As  a  first  step  the  Poisson  solver  using  Conjugate 
Gradient  type  iterative  schemes  (CG  and  BiCGSTAB),  originally  implemented 
in  F77,  has  been  rewritten  to  the  EASY  notation  and  now  it  runs  successfully 
on  parallel  machines  with  HPF.  The  next  step  involves  rewriting  and  testing 
the  preconditioner  [9]  for  the  block  penta-  and  heptadiagonal  Jacobian  matrices 
that  arise  in  implicit  time  stepping  schemes. 
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Abstract.  The  study  of  the  astrophysical  N-body  problem  requires  the 
use  of  numerical  integration  to  solve  a  system  of  6N  first-order  differen¬ 
tial  equations.  The  particle-particle  codes  (PP)  using  direct  summation 
methods  are  a  good  example  of  algorithms  where  parallelization  can 
speed  up  the  computation  in  an  efficient  way.  For  this  purpose,  a  serial 
version  of  the  PP  code  N NEWTON  developed  by  the  author  was  par¬ 
allelized  using  the  MPI  library  and  tested  on  the  CRAY-T3D  at  the 
EPCC.  The  results  of  the  parallel  code  here  presented  show  very  good 
efficiency  and  scaling,  up  to  128  processors  and  for  systems  up  to  16384 
particles. 


1  Introduction 

We  begin  by  an  introduction  to  the  Astrophysical  N-body  problem  and  the  math¬ 
ematical  models  used  in  our  work.  We  also  present  an  overview  of  particle  simu¬ 
lation  methods,  and  discuss  the  implementation  of  a  direct  summation  method: 
the  PP  algorithm.  A  parallel  version  of  this  algorithm  as  well  as  the  perform¬ 
ance  analysis  are  presented.  Finally,  the  conclusions  regarding  the  discussion  of 
results  are  offered. 

2  The  Astrophysical  N-Body  Problem 

The  gravitational  N-body  problem  refers  to  a  system  of  interacting  bodies 
through  their  mutual  gravitational  attraction,  confined  to  a  delimited  region 
of  space.  In  the  universe  we  can  select  systems  of  bodies  according  to  the  ob¬ 
servation  scale.  For  instance,  we  can  consider  the  Solar  System  with  N  =  10 
(a  restricted  model:  Sun  -I-  9  planets).  Increasing  the  observation  scale,  we  have 
systems  like  open  clusters  (systems  of  young  stars  with  typical  ages  of  the  order 
of  10®  yeeurs,  and  A'  ~  10^  -  10®),  globular  clusters  (systems  of  old  stars  with 
ages  of  12-15  billion  years,  extremely  compact  and  spherically  symmetric  with 
N  ~  10^  —  10®),  and  galaxies  (AT  ~  10^°  —  10^®).  On  the  other  extreme  of  our 
scale,  on  a  cosmological  scale,  we  have  clusters  of  galaxies  and  superclusters. 
If  we  want  to  consider  the  whole  universe,  the  total  number  of  galaxies  in  the 
observable  part  is  estimated  to  be  of  the  order  of  10®  (see  [2],  [18],  and  [9]). 

*  This  work  was  supported  by  EPCC/TRACS  under  Grant  ERB-FMGE-CT95-0051 
and  partly  supported  by  PRAXIS  XXI  under  GRANT  BM/594/94. 
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In  our  work  we  are  interested  in  the  dynamics  of  systems  with  N  up  to  the 
order  of  10^  (open  clusters  and  small  globular  clusters). 


2.1  The  Mathematical  Model 

In  our  mathematical  model  of  the  physical  system  each  body  is  considered  as  a 
mass  point  (hereafter  refered  to  as  particle)  characterized  by  a  mass,  a  position, 
and  a  velocity.  We  also  define  an  inertial  cartesian  coordinate  system,  suitably 
chosen  in  three-dimensional  Euclidean  space,  and  an  independent  variable  t,  the 
absolute  time  of  Newtonian  mechanics. 

The  state  of  the  system  is  defined  by  the  set  Sn  of  3N  parameters;  the 
masses,  positions,  and  velocities  of  all  particles.  Hence: 

‘S/v  =  {(mi,ri,fi),f  =  (1) 

where  Ti  and  fj  are  the  position  and  velocity  vector  of  particle  i,  respectively. 


Comments.  The  physical  state  of  the  system  can  be  represented  as  a  point  in  a 
6iV-dimensional  phase-space  with  coordinates  (ri, . . .  ,r;v,fi, . . .  ,fAr)  (see  [3]). 
However,  we  will  use  this  representation  of  the  system  which  is  more  suitable  for 
the  discussion  of  the  parallelization  of  the  N-body  integrator,  on  Sect.  3.3. 

The  force  exerted  by  particle  j  on  particle  i  is  given  by  Newton’s  Law  of 
Gravity: 


Fjj  =  —Gmimj 
and  the  total  force  acting  on  particle  i  is 


(2) 


N 

~  '  (3) 

The  right-hand  side  of  equation  (3)  represents  the  contribution  of  the  other  IV- 1 
particles  to  the  total  force. 

We  can  now  write  the  equations  of  motion  of  particle  i: 


(4) 


Defining  v*  =  fj  we  can  write  the  system  of  6A^  first-order  differential  equations: 


with  i  -  1,...,W.  The  evolution  of  the  N-body  system  is  determined  by  the 
solution  of  this  system  of  differential  equations  with  initial  conditions  (1). 

For  systems  with  N  =  2,  the  two-body  problem  known  as  the  Kepler  prob¬ 
lem,  (e.g.  the  Earth-Moon  system)  the  equations  of  motion  (5)  can  be  solved 
analytically.  However,  for  the  general  N(>2)-body  problem  that  is  not  the  case 
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(see  [3]),  and  we  must  use  numerical  methods  to  solve  the  system  of  differen¬ 
tial  equations.  In  Sect.  3  we  will  discuss  the  problem  of  numerical  integration  of 
N-body  systems. 

In  every  mathematical  model  of  a  physical  system  there  is  always  the  prob¬ 
lem  of  the  validity  of  the  model,  that  is,  how  suitable  the  model  is  to  describe 
the  physics  of  the  system.  In  our  case  we  are  representing  bodies  with  finite 
and,  in  general,  different  sizes  by  material  points:  bodies  endowed  with  mass, 
but  no  extension.  The  physics  of  the  interior  of  the  bodies  is  not  taken  into 
account.  However,  for  dynamical  studies  this  model  has  proven  to  be  suitable, 
and  has  been  used  to  study  the  evolution  of  clusters  of  stars,  galaxies,  and  the 
development  of  strutures  in  single  galaxies  (see  [9]). 


2.2  Exponential  Instabilities  in  N-body  Systems 


The  initial  motivation  of  this  work  was  the  study  of  the  exponential  instability 
is  self-gravitating  N-body  systems  (see  [16]).  In  this  problem  we  are  interested 
in  the  growth  of  a  pertubation  in  one  or  more  components  of  the  system.  For  a 
given  system  of  N  particles  we  consider  the  set 

•^AT  =  {(mi,r?,r?),z  =  1,...,^}  (6) 


of  initial  conditions  (at  time  t  =  to),  and  define  the  set  of  perturbed  initial 
conditions: 

^cS^  =  {(mi,Z\r^Zlfn,i  =  l,...,Ar}  (7) 

where  Ar°  and  At°  are  the  position  and  the  velocity  perturbation  vectors  for 
the  initial  conditions.  To  evaluate  the  growth  of  the  perturbations  we  must  solve 
the  system  of  3N  second-order  differential  equations  (see  [6]  and  nsap): 


N 


^  f(zlrj,/\rj,ri,ri)- 


m, 


Ti  -  r, 


(8) 


with  i  =  1, . . . ,  TV,  and 


f{ATi,Arj,Ti,Tj) 


=  Ari  -  Avj  -  Z^Avi  -  ATj).{Ti  - 


Defining  Avi  =  Ati  we  can  rewrite  (8)  in  the  form: 


(9) 


^Vj,  Aiti  —  ^  f(^ri,/lrj,ri,ri)iy--  ^  (10) 

with  i  =  1,...,TV,  Avi  =  {Axi,Ayi,Azi),  and  Awi  =  {Axi,  Ayi,  Azi).  This 
system  of  6TV  first-order  differential  equations,  the  variational  equations,  must 
be  solved  together  with  equations  (5). 

We  now  define  several  metrics  as  functions  of  the  components  of  the  perturb¬ 
ation  vectors  (see  [6]  and  [16]): 

.rnax^{\Axi\  +  \Ayi\  +  \Azi\)  (11) 
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1  ^ 

+  +  (12) 

t=i 

for  the  pertubations  in  the  position  vectors,  and 

AV  =  maxi^i ;v(|^ii|  +  \Ayi\  +  (13) 

1  ^ 

<  ziK  >=  —  ^(|/iij|  + |Zi2/j|  +  l-diil)  (14) 

i=l 

for  the  pertubations  in  the  velocity  vectors.  Each  metric  is  evaluated  for  each 
time  step  of  the  numerical  integration  of  equations  (5)  and  (10). 

The  analysis  of  the  quantities  given  by  equations  (11),  (12),  (13),  and  (14) 
is  very  important  to  understand  some  aspects  of  the  dynamical  behavior  of 
N^-body  systems  (see  [8],  [10],  [11],  and  [13]).  In  particular,  we  are  interested  in 
the  relation  between  collisions  and  the  growth  of  perturbations.  The  collisions 
between  bodies  are  an  important  mechanism  in  the  evolution  of  systems  like 
open  clusters  and  globular  clusters  (see  [2]  and  [9]). 

3  Numerical  Simulation  of  N-Body  Systems 

In  this  section,  we  will  briefly  discuss  the  use  of  particle  methods  to  solve  the 
N-body  problem  with  special  attention  to  the  direct  summation  method:  the  PP 
method  (see  [9],  for  an  excellent  and  detailed  presentation  of  these  methods). 
We  present  a  serial  version  of  the  PP  method  and  discuss  a  parallel  version  of 
that  method. 

3.1  Overview  of  Particle  Simulation  Methods 

Particle  methods  is  the  designation  of  a  class  of  simulation  methods  in  which  the 
physical  phenomena  are  represented  by  particles  with  certain  attributes  (such 
as  mass,  position,  and  velocity),  interacting  according  to  some  physical  law  that 
determines  the  evolution  of  the  system.  In  most  cases  we  can  establish  a  direct 
relation  between  the  computational  particles  and  the  physical  particles.  In  our 
work  each  computational  particle  is  the  numerical  representation  of  one  phys¬ 
ical  particle.  However,  in  simulations  of  physical  systems  with  large  N,  such  as 
galaxies  of  10^^  to  10^^  stars,  each  computational  particle  is  a  superparticle  with 
the  mass  of  approximately  10®  stars. 

We  will  now  discuss  the  three  principal  types  of  particle  simulation  meth¬ 
ods:  a  direct  summation  method,  a  particle-in-cell  (PIC)  method,  and  a  hybrid 
method. 

The  Particle-Particle  Method  (PP).  This  is  a  direct  summation  method: 
the  total  force  on  the  particle  is  the  sum  of  the  interactions  with  each  other 
particles  of  the  system.  To  determined  the  evolution  of  a  N-body  system  we 
consider  the  interaction  of  every  pair  of  particles,  that  is,  N{N  -  1)  pairs  (i,  j), 
with  i,j  =  l,...,Ar  M  ^  j.  The  numerical  effort  (number  of  floating-point 
operations)  is  observed  to  be  proportional  to  N^. 
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The  Particle-Mesh  Method  (PM).  This  is  a  particle-in-cell  method:  the 
physical  space  is  discretized  by  a  regular  mesh  where  a  density  function  is  defined 
according  to  the  attributes  of  the  particles  (e.g.  mass  density  for  a  self-gravitating 
N-body  system).  Solving  a  Poisson  equation  on  the  mesh,  the  forces  at  particle 
positions  are  then  determined  by  interpolation  on  the  array  of  mesh-defined 
values.  The  numerical  effort  is  observed  to  be  proportional  to  N.  The  gain  in 
speed  is  obtained  at  the  cost  of  loss  of  spatial  resolution.  This  is  particularly 
important  for  the  simulation  of  N-body  systems  if  we  are  interested  in  exact 
orbits. 


The  Particle-Particle-Particle-Mesh  Method  (P^M).  This  is  a  hybrid 
method:  the  interaction  between  one  particle  and  the  rest  of  the  system  is  de¬ 
termined  considering  a  short-range  contribution  (evaluated  by  the  PP  method) 
and  a  long-range  contribution  (evaluated  by  the  PM  method).  The  numerical 
effort  is  observed  to  be  also  proportional  to  N,  as  in  the  PM  method.  The  advant¬ 
age  of  this  method  over  the  PM  method  is  that  it  can  represent  close  encounters 
as  accurately  as  the  PP  method.  On  the  other  hand  the  P^M  method  calculates 
long-range  forces  as  fast  as  the  PM  method. 

Comments.  We  base  the  choice  of  method  according  to  the  physics  of  the  system 
under  investigation.  For  our  work  we  use  the  PP  method:  we  are  interested  in 
simulating  clusters  of  stars  where  collisions  are  important  and,  therefore,  spatial 
resolution  is  important.  On  the  other  hand,  for  the  values  of  N  used  in  some  of 
our  simulations  (N  ~  16  —  1024)  the  use  of  a  direct  summation  method  has  the 
advantage  of  providing  forces  that  are  as  accurate  as  the  arithmetic  precision  of 
the  computer. 


3.2  The  PP  Serial  Algorithm 

In  our  previous  work  (see  [16])  we  have  implemented  the  PP  method  using 
FORTRAN  77.  Several  programs  were  writen  (the  NNEWTON  codes)  but  only 
two  versions  are  considered  here:  a  PP  integrator  of  the  equations  of  motion, 
and  a  PP  integrator  of  the  equations  of  motion  -I-  variational  equations.  These 
two  versions  use  a  softened  point-mass  potential,  that  is,  the  force  of  interaction 
between  two  particles  i  and  j  is  defined  as  (see  [1],  [2],  and  [9]): 


Fjj  =  -Gmimj- 


(r,-r,.)2-fe2||3/2' 


(15) 


The  parameter  e  is  often  called  the  softening  parameter  and  is  introduced  to 
avoid  numerical  problems  during  the  integration  of  close  encounters  between 
peirticles:  as  the  distance  between  particles  becomes  smaller  the  force  changes 
as  1/  II  Tj  —  Tj  IP  in  equation  (2)  and  extremely  small  time  steps  must  be  used 
in  order  to  control  the  local  error  of  truncation  of  the  numerical  integrator.  The 
softening  parameter  will  prevent  the  force  to  go  to  infinity  for  zero  distance 
causing  overflow  errors. 
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3.3  The  PP  Parallel  Algorithm  (P-PP) 

The  PP  method  has  been  used  to  implement  parallel  versions  of  N-body  integ¬ 
rators  by  several  authors  (see  [14],  for  instance).  Having  this  in  mind,  our  first 
goal  was  to  write  a  simple  algorithm  with  good  load-balance;  each  processor 
should  perform  the  same  amount  of  computations.  On  the  other  hand,  the  al¬ 
gorithm  should  be  able  to  take  advantage  of  an  increased  number  of  processors 
(scalability). 

In  our  algorithm  the  global  task  is  the  integration  of  the  system  of  equations 
(5),  for  N  particles,  and  the  sub-tasks  are  the  integration  of  sub-sets  Sni,  of 
Nk  particles,  with  k  =  0,...,p,  where  P  =  p  -h  1  is  the  number  of  available 
processors.  The  parallel  algorithm  implements  a  single  program  multiple  data 
(SPMD)  programming  model:  each  sub-task  is  executed  by  the  same  program 
operating  on  different  data  (the  sub-sets  of  particles). 

The  diagram  in  figure  1  shows  the  structure  of  the  parallel  algorithm  and  the 
main  communication  operations.  The  data  aire  initially  read  from  a  file  by  one 
processor  and  a  broadcast  communication  operation  is  performed  to  share  the 
initial  configuration  of  the  system  between  every  available  processor.  To  each 
processor  {k)  is  then  assigned  the  integration  of  a  sub-set  of  particles.  The 
global  time  step  is  also  determined  by  a  global  communication  operation,  and 
at  the  end  of  each  time  iteration  the  new  configuration  of  the  particles  (in  each 
sub-set  Si^k)  is  shared  between  all  processors. 

The  load-balance  problem  is  completely  avoided  in  this  algorithm  since  each 
processor  is  responsible  for  the  same  number  of  particles.  The  defined  sub-sets 
of  particles  are  such  that 


p  P 

=  '^Nk  =  N  (16) 

k^O  fc=0 

and 

=  Nj,  i,j  =  0,  ...,p. 


4  Implementation  of  the  Parallel  Algorithm 

4.1  The  Message  Passing  Model 

The  implementation  of  the  P-PP  algorithm  was  done  in  the  framework  of  the 
message  passing  model  (see  [5]  and  [7]).  In  this  model  we  consider  a  set  of 
processes  (each  identified  with  a  unique  name)  that  have  only  local  memory  but 
are  able  to  communicate  with  other  processes  by  sending  and  receiving  messages. 

Most  of  the  message  passing  systems  implement  a  SPMD  programming  model: 
each  process  executes  the  same  program  but  operates  on  different  data.  However, 
t  e  message  passing  model  does  not  preclude  the  dynamic  creation  of  processes, 
the  execution  of  multiple  processes  per  processor,  or  the  execution  of  different 
programs  by  different  processes. 
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Fig.  1.  The  diagram  shows  the  structure  of  the  parallel  algorithm  and  the  main  com¬ 
munication  operations:  broadcasting  the  initial  configuration  of  the  system  to  all  pro¬ 
cessors,  determination  of  the  global  time  step  and  the  global  communication  between 
processors  to  share  the  new  configuration  of  the  system  after  one  time  step.  Each  pro¬ 
cessor  PEic,  [k  =  0,  ...p)  is  responsible  for  the  integration  of  its  sub-set  Sn^.  of  particles. 


For  our  work,  this  model  has  one  important  advantage:  it  fits  well  on  separate 
processors  connected  by  a  communication  network,  thus  allowing  the  use  of  a 
supercomputer  as  well  as  a  network  of  workstations. 


4.2  The  MPI  Library 

To  implemented  the  parallel  algorithm  the  Message  Passing  Interface  (MPI) 
library  (see  [5]-[12])  was  chosen  for  the  following  reasons: 

-  source-code  portability  and  efiBcient  implementations  across  a  range  of  ar¬ 
chitectures  are  available, 

-  functionality  and  support  for  heterogeneous  parallel  architectures. 

Using  the  MPI  library  was  possible  to  develop  a  parallel  code  that  runs  on  a 
parallel  supercomputer  like  the  Cray-T3D  and  on  a  cluster  of  workstations.  On 
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the  othet  hand,  from  the  programming  point  of  view  Is  Very  simple  tb  implement 
a  message  passing  algorithm  Using  the  library  functions. 


4.3  AnalyslB  df  the  MPl  ithtiiehifelitaliott 


The  Mt’I  imt)iementation  bf  the  t'-t’P 
of  a  small  number  of  library  fehctions. 
FORTRAN  77 i  the  MNEWTON  todes, 
following  functions  (see  [y]): 


algotlthm  Was  possible  With  the  use 
Two  versions  of  the  codes  writen  in 
(see  [16])  were  parallelized  using  the 


Initialization 

1.  Ml’l.lhlt:  Initializes  the  MR!  execution  etivlronment. 

2.  Determlnefe  the  number  of  processors. 

3.  MPt-COMMJlAWk:  Ofetetmlhes  the  Idehtlllet  of  a  ptocessor. 


Data  Stfiiettiresi  Special  data  striibtuteB  were  defined  containing  the  system 
configuration. 

4.  MPl.typ£_EXtfeht!  ketUths  the  size  ^  a  datatype. 

5.  MPl.TYP6.^fftUpt:  Creates  a  structure  datatype. 

6.  MPl.TYPfe.COMhit:  Commits  a  hOw  datatype  to  the  systemi 

7.  MPI.TYPE-Pflkfe:  Rrees  a  nd  lohger  heeded  datatype. 


Communication!  One  bf  the  prbfceSkdrs  broadcasts  the  system  configuration 
to  all  other  processores. 

8.  MPUCAST:  Broadcasts  a  message  from  processor  With  rank  “root”  to  all 
other  processors  of  the  group. 


Global  Operations:  Used  to  compute  the  global  time  step,  and  to  share  the 
system  configuration  between  processors  after  one  iteration. 

9.  MPI_ALLREDUCE :  Combines  values  from  all  processors  and  distribute  the  res¬ 
ult  back  to  all  processors. 

10.  MPI_ALL_GATHERV:  Gathers  data  from  all  processors  and  deliver  it  to  all. 


Finalization 

11.  MPI_FINALIZE:  Terminates  MPI  execution  environment. 
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5  Performance  Analysis 

To  analyse  the  performance  of  a  parallel  program  several  metrics  can  be  con¬ 
sidered  depending  on  what  characteristic  we  want  to  evaluate.  In  this  work  we 
are  interested  in  studying  the  scalability  of  the  P-PP  algorithm,  that  is,  how 
effectively  it  can  use  an  increased  number  of  processors.  The  metrics  we  used  to 
evaluate  the  performance  are  functions  of  the  program  execution  time  (T),  the 
problem  size  {N,  number  of  particles),  and  processor  count  (P).  In  this  section 
we  will  define  the  metrics  (as  in  [5]  and  [14])  and  discuss  their  application. 


5.1  Metrics  of  Performance 


We  will  consider  three  metrics  for  performance  evaluation:  execution  time,  rel¬ 
ative  efficiency,  and  relative  speedup. 

Definition  1.  The  execution  time  of  a  peirallel  program  is  the  time  that  elapses 
from  when  the  first  processor  starts  executing  on  the  program  to  when  the  last 
completes  execution. 

The  execution  time  is  actually  the  sum  over  the  number  of  processors  of 
three  distinct  times:  computation  time  (during  which  the  processor  is  performing 
calculations),  communication  time  (time  spent  sending  and  receiving  messages), 
and  idle  time  (the  processor  is  idle  due  to  lack  of  computation  or  lack  of  data). 

In  this  study  the  program  is  allowed  to  run  for  10  iterations  and  the  execution 
time  is  mesured  by  the  time  of  one  iteration  {Tone  =  Tten/10). 


Definition  2.  The  relative  efficiency  (Er)  is  the  ratio  between  time  Ti  of  exe¬ 
cution  on  one  processor  and  time  Tp  of  execution  on  P  processors. 


Er  = 


Ti 

PTp' 


(17) 


The  relative  efficiency  represents  the  fraction  of  time  that  processors  spend 
doing  usefull  work.  The  time  each  processor  spends  communicating  with  other 
processors  or  just  waiting  for  data  or  tasks  (idle  time)  will  make  efficiency  always 
less  than  100%  (this  may  not  be  true  is  some  cases  where  we  have  a  superlineax 
regime  due  to  cache  effects  but  we  will  not  discuss  it  in  this  work). 


Definitions.  The  relative  speedup  {Sr)  is  defined  as  the  ratio  between  time  Ti 
of  execution  on  one  processor  and  time  Tp  of  execution  on  P  processors, 

•5r  =  ^.  (18) 

The  relative  speedup  is  the  factor  by  which  execution  time  is  reduced  on  P 
processors.  Ideally,  a  parallel  program  running  on  P  processors  would  be  P  times 
faster  than  on  one  processor  and  we  would  get  Sr  =  P.  However,  communication 
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time  and  idle  time  on  each  processor  will  make  5,  always  smaler  than  P  (except 
on  the  superlinear  regime). 

These  quantities  are  very  useful  to  analyse  the  scalability  of  a  parallel  pro¬ 
gram  however,  efficiency  an  speedup  as  defined  above  do  not  constitute  an  ab- 

solute  fi^re  of  merit  since  the  time  of  execution  on  a  single  processor  is  used  as 
the  baseline. 


5.2  Performance  Results  of  the  PNNEWTON  Code 

For  the  performance  analysis  of  the  algorithm  we  mesured  the  time  of  one  it¬ 
eration  for  a  range  of  values  of  two  parameters;  problem  size,  and  number  of 
processors.  The  relative  efficiency  and  relative  speedup  were  then  evaluated  us¬ 
ing  equations  (17)  and  (18). 

The  objectives  of  this  analysis  are  two-fold.  First,  we  want  to  investigate 
ow  the  metrics  vary  with  mcreasmg  number  of  processors  for  a  fixed  problem 
size^  Second,  we  want  to  investigate  the  behavior  of  the  algorithm  for  different 
problem  sizes  within  the  range  of  interest  for  our  N-body  simulations.  For  that 
purpose  the  parallel  code  {?NNEWTON)  was  tested  on  the  Cray-T3D  svstem 
Computer  Centre  (EPCC).  The  system  consists  of  512 
7.0  arranged  on  a  tridimensional  torus  and  running  at 

15U  MHz.  The  peak  performance  of  the  T3D  array  itself  is  76.8  Gfiop/s  (see  [4]). 

The  next  figures  show  the  results  of  the  tests  for  systems  with  N  =  2^,.. .  2^^ 
The  code  w^  integrating  equations  (5).  Similar  tests  were  performed  for  another 
version  of  the  PNNEWTON  code  which  integrates  equations  (5)  and  (10)  and 
identical  results  were  obtained.  ’ 


6  Conclusions 


development  of  a  parallel  code  suitable  to 
study  N-body  systems  with  iV  ~  10  -  10^  The  required  features  of  the  program 

performed  on  both  versions 
{PNNEWTON  1.0  and  2.0)  showed  an  almost  linear  speedup  and  a  relative 
efficiency  between  60%  and  98%.  The  worst  cases  {Er  v  60%  and  Er  «  65%) 
correspond  to  a  system  with  64  particles  running  on  64  processors,  and  to  a 
system  with  128  particles  running  on  128  processors.  With  those  configurations 
the  communication  costs  are  comparable  to  the  computational  costs  and  the 
emciency  drops. 

fo"  the  parallelization 

the  PP  algorithm  is  possible  to  write  a  portable  code  with  high  efficiency  and 
goo  scalabdity.  Our  parallel  algorithm  appears  to  be  appropriate  to  develop 
parallel  versions  of  the  PP  method.  ^ 
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Time  of  one  Iteration  -  PNNEWTON(v1 .0)  /  CRAY-T3D  (EPCC) 


Fig.  2.  For  each  value  of  N=2*,  (fc  =  6, 14)  the  system  is  allowed  to  evolve  during 
ten  time  steps.  The  computation  was  performed  on  a  different  number  of  processors. 
The  variation  of  the  time  of  one  iteration  with  the  number  of  processors  for  the  tested 
systems  shows  a  good  scalling. 


Reiative  SpearJup  ■  PNNEWTONfvl  .0)  /  CRAY-T3D  (EPCC) 

140  I - ! - 1 - 1 - ^ - j - [— - r 


Number  of  processora 


Fig.  3.  The  program  is  showing  a  good  scalability  for  the  tested  configurations.  The 
speed  up  is  almost  linecu-. 
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Fig.  4.  The  program  shows  high  efficiency  for  most  of  the  configurations  tested.  The 
lowest  efficiencies  correspond  to  cases  where  the  cost  of  communications  is  relevant 
(the  number  of  particles  is  the  same  as  the  number  of  processors). 
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Abstract.  The  Genoa  Active  Message  Machine  (GAMMA)  is  a  high- 
performcince  Active  Messages-like  communication  layer  implemented  at 
kernel  level  as  ein  extension  of  the  Linux  Operating  System,  and  made 
available  to  user  applications  through  a  programming  library.  On  low- 
cost  clusters  of  Personal  Computers  (PCs)  connected  by  Fast  Ethernet, 
GAMMA  achieves  much  better  communication  perform2ince  compared 
to  public  domain  implementations  of  MPI  and  PVM. 

We  have  considered  an  existing  PVM  Molecular  Dynamics  (MD)  parallel 
application,  designed  to  be  portable  across  various  MPP  as  well  as  NOW 
platforms.  The  goal  of  our  work  is  to  show  how  much  migrating  such  a 
complex  application  from  PVM  to  GAMMA  is  convenient  in  terms  of 
absolute  performance  improvement  as  well  as  price/performance  ratio  in 
the  perspective  of  running  MD  on  a  low-cost  cluster  of  PCs.  The  “mi¬ 
gration”  approach  is  then  compared  to  other  two  eJternatives,  namely: 
running  the  PVM  version  of  MD  “as  is”  on  a  cluster  of  PCs  and  trying 
tumng  the  PVM  version  of  MD  to  match  the  underlying  cluster  architec¬ 
ture.  It  is  shown  that  neither  of  such  two  cilternatives  lead  to  satisfactory 
performance. 


Keywords:  Fast  Ethernet;  Molecular  Dynamics;  Network  of  workstations;  Parallel 
processing;  Personal  computers. 


1  Introduction 

Molecular  Dynamics  (MD)  is  one  of  the  most  frequent  parallel  applications  in 
the  scientific  community.  MD  typically  exhibits  fairly  good  speed-up  figures  on  a 
wide  range  of  parallel  computers  with  good  intrinsic  load  balancing.  This  offers 
the  opportunity  to  investigate  the  behaviour  of  large  size  samples  of  material  by 
numerical  simulation. 

Network  Of  Workstations  (NOWs)  have  emerged  as  the  first  cost-effective 
parallel  architecture.  Cluster  of  high-end  Personal  Computers  (PCs)  are  emerging 
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as  an  even  better  solution,  with  unbeated  price/performance  ratio  and  potentially 
good  absolute  performance  levels. 

A  serious  obstacle  to  running  MD  on  a  cluster  of  PCs  is  the  high  commu¬ 
nication  latency  exhibited  by  standard  parallel  programming  environments  like 
PVM  [6]  and  MPI  [7]  running  atop  industry-standard  communication  protocols 
like  TCP  and  UDP.  Recently  several  teams  have  been  engaged  in  producing 
efficient  solutions  using  faster  networks  and  optimized  communication  software 
to  keep  latency  as  low  as  possible.  Many  of  such  attempts  gave  rise  to  non¬ 
standard  programming  interfaces  for  high-performance  communication.  Porting 
a  non-trivial  parallel  application  on  a  non-standard  communication  layer  mav  be 
an  expensive  task.  However  a  better  price/performance  ratio  and  a  satisfactory 
absolute  performance  level  on  a  cluster  of  PCs  may  justify  the  porting  effort. 

In  this  paper  we  discuss  three  experiences  of  porting  an  existing  MD  parallel 
application  on  a  low-cost  cluster  of  PCs.  The  original  MD  code  is  a  FORTRAN 
program  with  calls  to  PVM  communication  routines.  The  low-cost  cluster  is  a 
pool  of  sixteen  Pentium  133  MHz  PCs,  each  equipped  with  32  MByte  of  RAM  and 
256  KByte  of  second-level  cache,  networked  by  a  shared  lOObase-TX  Ethernet 
LAN.  Each  PC  runs  Linux,  a  POSIX-compliant  Unix  operating  system. 

The  first  experience  [3]  consists  of  migrating  MD  from  PVM  to  the  the  Genoa 
Active  Message  MAchine  (GAMMA)  [1,  2],  an  efficient  communication  system 
based  on  Active  Messages  [8]  and  designed  for  best  efficiency  on  lOObase-T 
clusters  of  PCs.  Porting  MD  to  GAMMA  required  replacing  PVM  calls  with 
calls  to  communication  routines  from  the  GAMMA  library,  as  well  as  changing 
some  communication  patterns  in  order  to  achieve  better  exploitation  of  the  capab¬ 
ilities  of  the  underlying  network  hardware  fully  exposed  by  GAMMA.  Therefore 
the  corresponding  porting  effort  was  not  negligible.  The  obtained  MD  application 
shall  be  called  MD-GAMMA  hereafter. 

The  second  porting  experience  (also  described  in  [3])  consists  of  running  the 
original  PVM  version  of  MD  “as  is”  on  our  cluster.  This  corresponds  to  a  zero 
porting  effort. 

The  third  porting  experience  consists  of  trying  tuning  the  communication 
patterns  of  the  original  PVM  version  of  MD  in  order  to  increase  the  match  with 
the  network  architecture  of  our  cluster.  This  implies  a  very  limited  porting  effort 
The  obtained  application  shall  be  called  MD-TOKEN  hereafter,  as  a  circulating 
token  has  been  added  to  reduce  network  contention. 


2  The  Genoa  Active  Message  MAchine  (GAMMA) 

The  Genoa  Active  Message  MAchine  (GAMMA)  [1,  2]  is  an  efficient  messaging 
system  based  on  Active  Messages  [8].  GAMMA  is  mainly  implemented  as  a  cus¬ 
tom  network  device  driver  plus  a  small  number  of  additional  system  calls  extend¬ 
ing  the  Linux  kernel.  Currently  only  the  3COM  3c595  and  3c905  Fast  Ethernet 
adapters  are  supported.  The  GAMMA  programming  interface  is  a  small  yet 
complete  set  of  communication  functions  supporting  SPMD  as  well  as  MIMD 
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programming  styles,  and  made  available  to  user  applications  through  a  program¬ 
ming  library. 

The  efficiency  of  GAMMA  is  mainly  due  to  three  features,  namely: 

-  A  “zero  copy”  communication  protocol,  that  is  no  temporary  buffers  for 
messages  along  the  whole  communication  path,  thanks  to  the  adoption  of  the 
Active  Messages  communication  paradigm.  This  enables  low-latency  commu¬ 
nication. 

-  A  pipelined  communication  path,  that  is  the  various  stages  of  the  commu¬ 
nication  path  work  in  parallel  for  best  communication  throughput.  Every 
messaging  system  works  in  a  pipelined  way  when  delivering  large  messages 
fragmented  into  smaller  units,  but  GAMMA  allows  a  pipelined  path  yet  with 
small,  unfragmented  messages.  This  allows  best  throughput  for  small  as  well 
as  large  messages. 

-  Broadcast  primitives  which  directly  expose  the  Ethernet  hardware  broadcast 
features  to  the  applications.  This  allows  efficient  broadcast  communication. 

With  GAMMA,  any  process  of  a  given  parallel  application  owns,  and  may 
activate  and  use  thereof,  255  communication  ports  through  which  it  can  send  and 
receive  messages.  Useful  communication  ports  are  numbered  in  the  range  from 
zero  to  254.  Port  number  255  is  currently  reserved  to  the  implementation  of  the 
barrier  synchronization.  Prior  to  using  any  of  its  own  ports,  the  process  may- 
bind  it  to: 

-  A  port  of  a  destination  process,  for  messages  that  will  be  sent  throughout 
the  port. 

-  A  destination  buffer  in  user  space  for  storing  incoming  messages. 

-  A  program-defined  function  acting  as  receiver  handler  for  the  port. 

A  GAMMA  receiver  handler  is  an  application-defined  function  which  will  be 
run  at  each  message  arrival.  Such  function  will  “consume”  the  message  itself 
and  possibly  prepare  a  fresh  final  destination  for  the  next  incoming  message. 
For  instance,  in  order  to  avoid  that  a  subsequent  incoming  message  over¬ 
laps  the  previous  one  in  the  same  user-space  destination  buffer,  the  receiver 
handler  may  re-bind  the  port  to  a  fresh  destination  for  the  next  incoming 
message. 

-  A  program-defined  function  acting  as  error  handler  for  the  port.  A  GAMMA 
error  handler  is  like  a  receiver  handler,  but  it  is  issued  in  case  of  communic¬ 
ation  errors  rather  than  upon  successful  message  receptions.  The  purpose  of 
error  handlers  is  to  help  building  application-level  error  recovery  policies. 

After  a  port  is  bound,  its  number  fully  defines  the  destination  of  messages 
sent  through  the  port,  as  well  as  the  user-space  final  destination  of  messages 
incoming  through  the  port  and  the  actions  performed  by  the  process  in  order  to 
consume  them. 

With  GAMMA  the  programmer  is  forced  to  bind  a  port  for  input  before 
receiving  messages  from  that  port.  This  implies  that  the  kernel  is  notified  the 
address  of  the  destination  user-space  buffer  in  advance  w.r.t.  the  message  arrival. 
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Therefore  the  activity  of  storing  incoming  messages  into  their  final  destinations 
can  be  performed  directly  by  the  GAMMA  device  driver  rather  than  by  the  user- 
defined  receiver  handlers,  and  does  not  require  any  temporary  kernel  buffer. 

2.1  Synchronous  receive  in  GAMMA 

With  Active  Messages  there  is  a  “send”  but  no  “receive”  operation  Instead 
the  receiver  handlers  act  as  independent  •  threads  of  the  application  triggered 
by  message  arrivals  to  perform  the  receive  activities.  Additional  programming 
efifort  must  be  spent  to  ensure  that  receiver  threads  correctly  cooperate  with  the 
main  process  thread.  A  very  frequent  problem  is  when  the  main  thread  needs  to 
synchronize  with  a  message  arrival  before  continuing  computation  (e.g.  when  the 
process  needs  to  receive  data  before  processing  them).  A  general  solution  is  to 
use  application-defined  synchronization  flags  as  follows; 

1.  A  flag  F  of  the  application  is  initially  reset. 

2.  In  order  to  wait  for  one  incoming  message  from  a  port  P,  the  receiver  process 

starts  busy-waiting  in  a  loop  until  F  is  set.  , 

3.  The  receiver  handler  bound  to  port  P  sets  F  upon  message  arrival. 

GAMMA  offers  a  more  flexible  and  reliable  solution  in  the  form  of  two 
semaphore-oriented  library  functions,  namely  gamma.wait  ( )  and 
ga:ma_signal().  Such  functions  give  safe  access  to  per-port  semaphores  embed¬ 
ded  into  the  GAMMA  library.  The  example  above  becomes  as  follows: 

1.  Li  order  to  wait  for  one  incoming  message  from  port  P,  the  receiver  process 
issues  gamma_wait(P,  1) 

2.  The  receiver  handler  bound  to  port  P  issues  gamma_signal(P)  upon  message 
arrival. 


2.2  Communication  performance 

On  our  low-cost  cluster  of  PCs,  GAMMA  achieves  one-way  “ping-pong”  user- 
to-user  message  latency  as  low  as  13  /zs,  with  asymptotic  bandwidth  as  high 
as  12.2  MByte/s  (98%  of  the  maximum  lOObase-T  Ethernet  throughput).  Half 
the  asymptotic  bandwidth  is  achieved  with  messages  as  short  as  200  byte.  Such 
performance  numbers  are  measured  at  application  level,  that  is  they  represent 
the  communication  performance  effectively  delivered  to  user  applications. 

In  terms  of  latency  GAMMA  rivals  many  much  more  expensive  massively 
parallel  platforms.  Obviously  GAMMA  cannot  compete  with  such  platforms  in 
terms  of  bandwidth  as  well  as  scalability.  On  the  other  hand  no  massively  parallel 
computer  can  compete  with  GAMMA  in  terms  of  price/performance  ratio. 

3  The  Molecular  Dynamics  application 

Our  MD  application  [4,  5]  is  a  typical  Molecular  Dynamics  code  used  for  simu¬ 
lating  the  behaviour  of  polarizable  fluids.  The  current  release  of  MD  is  written  in 
FORTRAN  with  calls  to  PVM  routines,  and  is  structured  as  a  MIMD  application. 
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The  simulation  of  material  samples  with  larger  number  of  molecules  turns  the 
behaviour  of  MD  from  communication  intensive  to  computation  intensive.  In  our 
investigations  the  number  of  molecules  has  been  kept  as  low  as  4000  to  stress  the 
communication  side. 

MD  performs  a  standard  Lennard-Jones  calculation  plus  the  solution  of  the 
induced  polarizability  on  each  molecule  taking  in  account  first  dipole  momentum. 
Each  step  of  MD  consists  of  evaluating  the  induced  dipoles  pi  consistent  with 
the  values  of  e\^'^  due  to  a  given  distributions  of  the  point  charges.  This  part  of 
the  calculation  requires  an  iterative  procedure  with  small  computation  time  and 
many  communications  to  exchange  the  values  of  the  induced  polarizability  at  each 
iteration  among  all  processors.  For  a  small  number  of  molecules  the  cutoff  radius 
is  of  the  same  size  as  the  replicated  box  and  the  number  of  force  vectors  between 
molecule  pairs  grows  almost  quadratically  with  the  total  number  of  molecules. 
In  such  a  situation  any  domain  decomposition  technique  based  on  the  spatial 
position  of  each  molecule  in  the  box  is  not  feasible. 

In  the  parallel  implementation  each  processor  maintains  a  copy  of  the  posi¬ 
tion  of  each  molecule.  However  each  processor  will  compute  force  pairs  only  on  a 
predefined  subset  of  molecules  which  has  been  previously  assigned  to  it.  In  this 
way  the  list  of  interacting  particles,  which  is  by  far  the  larger  data  structure  of 
MD,  could  be  partitioned  among  the  computation  nodes  and  the  total  memory 
occupancy  per  processor  is  expected  to  decrease  with  increasing  number  of  com¬ 
putation  nodes. 

When  using  high-latency  communication  systems  like  PVM,  an  important  op¬ 
timization  is  to  keep  the  number  of  distincts  messages  as  low  as  possible  in  order 
not  to  pay  too  much  for  the  communication  start-up  costs.  This  is  achieved  by 
packing  all  the  variables  to  be  communicated  (i.e.  forces, virial, energy)  in  a  single 
outgoing  message  whenever  possible.  Keeping  the  number  of  distinct  messages  as 
small  as  possible  reduces  the  possibility  of  using  multicast/broadcast  communic¬ 
ation  primitives,  since  in  PVM  such  collective  communications  are  implemented 
as  bare  repetitions  of  point-to-point  communications.  Almost  all  communica¬ 
tions  were  point-to-point  ones,  but  a  few  of  them,  i.e.  the  exchange  of  the  new 
coordinates  of  the  molecules. 

4  Migrating  the  application  from  PVM  to  GAMMA 

In  order  to  migrate  MD  from  PVM  to  GAMMA  to  obtain  the  MD-GAMMA  ap¬ 
plication,  the  GAMMA  programming  library  has  been  extended  with  FORTRAN 
stubs  to  the  original  GAMMA  communication  C  functions  in  a  straighforward 
way. 

Our  PC  cluster  is  equipped  with  low-cost  shared  lOObase-T  Ethernet  hard¬ 
ware.  This  implies  that  the  communication  patterns  of  MD  may  cause  lots  of 
Ethernet  collisions,  with  heavy  communication  delays.  This  could  be  partially 
avoided  if  the  Fast  Ethernet  hub  be  replaced  by  a  switch,  but  at  a  higher  price. 
The  alternative  is  to  explicitly  program  a  proper  serialization  of  network  accesses 
at  the  application  level  and  to  take  best  advantage  of  the  Ethernet’s  hardware 
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broadcast  facility  that  the  GAMMA  programming  interface  directly  exposes 
The  serialization  of  communications  during  collective  all-to-all  data  exchanges 
has  been  obtained  in  MD-GAMMA  by  considering  all  processes  as  circularly 
ordered  by  instance  number  and  implicitly  granting  broadcast  transmission  right 
to  a  process  after  it  has  received  broadcast  messages  from  all  its  predecessors. 

Another  source  of  performance  degradation  with  MD  is  the  need  of  application- 
level  temporary  storage  for  incoming  messages.  Even  with  a  “zero-copy”  mes¬ 
saging  system  like  GAMMA,  MD-GAMMA  must  implement  a  temporary  storage 
for  received  messages,  because  some  broadcast  messages  carry  information  to  be 
scattered  among  many  processors  and  summed  component-wise  to  existing  local 
information  arranged  as  arrays. 

A  potential  problem  with  GAMMA  is  that  the  receiver  is  forced  to  accept 
messages  in  their  final  desdnation  at  any  time  the  sender  starts  a  communication. 
This  may  cause  race  conditions  in  the  memory  of  the  receiver  process  during  the 
all-to-all  exchange  phase  of  MD.  Such  all-to-all  exchange  is  a  two-steps  oper¬ 
ation  structured  as  two  communication  phases  interleaved  by  one  computation 
phase.  In  the  computation  phase  the  fresh  data  from  the  first  communication 
phase  are  manipulated  i.e.  summed  to  previous  data.  If  data  from  the  second 
communication  were  delivered  in  the  same  data  structure  as  data  from  the  first 
communication,  an  inconsistency  would  arise  if  the  second  communication  occurs 
before  the  intermediate  computation  step  is  complete.  To  avoid  such  race  condi- 
tions  in  MD-GAMMA  we  had  to  implemented  FIFO  queues  of  application  receive 
buffers  for  storing  incoming  GAMMA  messages.  Computations  are  carried  out 
directly  on  the  FIFOs’  head  arrays,  whereas  fresh  incoming  data  are  stored  in 
the  FIFOs  tail  arrays.  This  way  data  from  the  second  communication  phase  do 
not  overwrite  data  from  the  first  phase  which  have  not  yet  been  processed. 

Migrating  MD  from  PVM  to  GAMMA  required  one  week  of  work  from  the 
first  author  of  this  paper  to  replace  PVM  calls  with  GAMMA  calls,  change  some 
communication  patterns  and  implement  Active  Messages-like  receive  policies, 
plus  an  additional  week  of  work  from  the  second  author  to  debug  and  run  the 
obtained  MD-GAMMA  application. 


5  Tuning  the  existing  PVM  application 

Another  possibility  for  porting  an  existing  PVM  application  on  a  given  tar¬ 
get  platform  is  to  retain  the  original  message  passing  interface  and  to  tune  the 
communication  patterns  of  the  application  in  order  to  increase  performance  by 
matching  the  target  architecture. 

In  the  case  of  MD,  an  obvious  drawback  of  the  original  version  when  running 
on  a  bus-interconnected  pool  of  processing  nodes  like  a  PC  cluster  with  shared 
Fast  Ethernet  is  bus  contention,  which  may  cause  unacceptably  large  communic¬ 
ation  delays  due  to  collision  storms.  The  easiest  way  to  overcome  such  problem  is 
to  serialize  processes  when  accessing  the  network  by  adding  a  circulating  token 
implemented  by  ordered  exchanges  of  null  PVM  messages. 
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In  our  preliminary  study  we  added  a  circulating  token  only  in  one  subroutine 
of  MD,  which  turns  out  to  be  heavily  used  in  the  program  run.  The  obtained  MD- 
TOKEN  application  required  a  very  limited  working  effort.  The  token  overhead 
is  negligibe  compared  to  the  overall  communication  overhead  as  well  as  the  MD 
computation  time. 

6  Performance  results 

Let  us  consider  the  speed-up  curves  depicted  in  Figures  1.  The  slow-down  ex¬ 
hibited  by  MD  as  the  number  of  processors  increases  beyond  eight  is  clearly 
apparent.  Given  the  low  computational  power  of  Pentium  133  MHz  CPUs,  such 
behaviour  accounts  for  the  poor  eflSciency  of  the  PVM  messaging  systems  in¬ 
volving  many  temporary  copies  of  messages  during  the  traversal  of  many  layers 
of  communication  protocols,  as  well  as  the  collision  storms  arising  from  pro¬ 
cesses  simultaneously  accessing  the  shared  LAN  during  the  exchange  phases  of 
the  program  execution. 

However  the  excellent  speed-up  curve  of  MD-GAMMA  up  to  16  nodes,  with 
the  promise  of  a  good  scaling  over  even  more  processors,  is  mainly  due  to  the 
following  reasons: 

-  the  relatively  poor  floating-point  computational  power  of  Pentium  133  MHz 
CPUs 

-  the  high  efficiency  of  GAMMA  inter-process  communications 

-  the  fine  tuning  of  the  communication  patterns  in  the  GAMMA  version  of  the 
application,  based  on  the  knowledge  of  features  (broadcast)  and  limitations 
(shared  LAN)  of  the  underlying  communication  hardware. 

In  spite  of  its  lower  collision  rate,  MD-TOKEN  shows  a  speed-up  curve  which 
is  even  worse  than  MD.  The  reason  is  that  serializing  network  accesses  by  a 
circulating  token  implies  serializing  the  software  overhead  of  communications  as 
well.  When  communication  overhead  is  high,  as  with  ordinary  PVM,  the  potential 
advantage  of  eliminating  collisions  is  by  far  recovered  by  the  loss  of  parallelism  in 
the  execution  of  low-level  communication  software.  Thus,  coordinating  processes 
at  application  level  in  the  hope  of  making  better  use  of  the  network  may  result 
into  a  counter  effect  with  high-latency  messaging  systems.  It  is  worth  noting  that 
the  overhead  of  the  circulating  token  itself  is  negligible  (less  than  5%  with  16 
nodes) . 

Figure  2  reports  the  average  completion  time  per  time-step  for  MD  as  well  as 
MD-GAMMA  and  MD-TOKEN  on  our  PC  cluster.  The  curve  of  average  com¬ 
pletion  time  per  time-step  of  MD  on  an  eight- “thin-nodes”  IBM  SP2  is  reported 
too.  MD-GAMMA  appears  to  outperform  the  IBM  SP2  if  more  than  twelve  pro¬ 
cessors  are  engaged  in  the  computation,  besides  performing  better  than  the  other 
two  MD  versions.  When  reading  such  curves  it  is  important  to  pay  attention  to 
both  the  absolute  performance  and  the  cost  of  the  hardware  platform.  It  is  worth 
pointing  out  that  the  current  cost  on  the  marketplace  of  a  16-nodes  GAMMA 
leveraging  shared  lOObase-T  Ethernet  and  Pentium  133  MHz  CPUs  is  compar¬ 
able  to  the  cost  of  one  single  high-end  workstation. 
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Optimal  speed-up  — 
MD-GAMMA  -e 


MD  H- 
MD-TOKEN  -B 


Number  of  nodes 

Fig.  1.  Molecular  Dynamics,  GAMMA  vs.  PVM:  speed-up  comparison  with  same  hard¬ 
ware  platform  (shared  lOObcise-T  Ethernet  network  of  Pentium  133  PCs). 


7  Conclusions 

By  using  a  low-latency  messaging  system  like  GAMMA,  a  significant  number  of 
networked  PCs  may  be  successfully  exploited  to  run  parallel  code  even  with  a  low- 
cost  interconnect  like  shared  lOObase-T  Ethernet.  Indeed  low-latency  as  well  as 
native  broadcast  communications  offer  more  flexibility  at  the  programming  level 
to  implement  collision-free  collective  communication  patterns.  Similar  collision- 
free  patterns  are  not  feasible  with  high-latency  messaging  systems  like  P\^M 
providing  a  poor  implementation  of  broadcast  and  too  high  a  communication 
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overhead,  which  are  not  expected  to  decrease  at  the  same  rate  at  which  the 
peak  communication  bandwidth  offered  by  the  Ethernet  technology  is  increasing 
(not  to  mention  the  additional  loss  of  efficiency  when  moving  to  SMP  processing 
nodes). 

In  the  case  of  MD  it  is  apparent  that  exploiting  a  low-latency  messaging 
system  like  GAMMA  is  the  only  way  to  turn  a  low-cost  cluster  of  PCs  into  a 
cost-effective  solution  for  parallel  processing.  The  same  holds  for  the  large  class 
of  ‘‘non-embarassingly  parallel”  well-balanced  parallel  applications.  The  gain  in 
price/performance  as  well  as  the  good  absolute  performance  level  obtained  on 
such  kind  of  inexpensive  platforms  makes  the  porting  effort  worthwhile,  at  least 
in  the  case  of  well  documented  applications. 
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Abstract.  Increasing  the  instruction  level  parallelism  (ILP)  is  one  of  the  key  issues 
to  boost  the  performance  of  future  generation  processors.  Current  processor  organi¬ 
zations  include  different  mechanisms  to  overcome  the  limitations  imposed  by  name 
and  control  dependences  but  no  mechanisms  targeting  to  data  dependences.  Thus, 
these  dependences  will  become  one  of  the  main  bottlenecks  in  the  future.  Data  value 
speculation  is  gaining  popularity  as  a  mechanism  to  overcome  the  limitations 
imposed  by  data  dependences  by  predicting  the  values  that  flow  through  them.  In 
this  work,  we  present  a  smdy  of  the  potential  of  data  value  speculation  to  boost  the 
limits  of  instruction  level  parallelism  using  both  perfect  and  realistic  predictors. 
Speedups  obtained  by  data  value  speculation  are  very  huge  for  an  infinite  window 
and  still  significant  for  a  limited  window.  Different  prediction  schemes  oriented  to 
single  thread  and  multiple  threads  (from  a  single  program)  architectures  have  been 
studied.  The  latter  shows  a  significant  improvement  respect  to  the  former  for  FP 
benchmarks  although  the  difference  is  much  smaller  for  integer  programs. 


1  Introduction 

The  performance  of  superscalar  processors  is  limited  by  the  necessity  to  obey  the 
dependences  existing  among  the  program  instructions.  These  dependences  can  be  clas¬ 
sified  into  three  types[5];  name  dependences,  control  dependences  and  data  depend¬ 
ences. 

Name  dependences  appear  when  the  values  generated  by  two  instructions  are  to  be 
written  in  the  same  storage  location,  either  a  register  or  memory.  They  can  be  eliminated 
by  renaming  the  storage  location  that  causes  the  dependence  (i.e.  changing  the  name  of 
the  locations  where  the  values  are  to  be  written).  Register  renaming  is  a  well  known 
technique  that  deals  with  this  kind  of  dependences.  It  is  implemented  dynamically  by 
many  current  microprocessors  such  as  DEC  Alpha  21264  [4]  or  MIPS  R 10000  [23]. 

Control  dependences  are  caused  by  branch  instructions.  They  slow  down  the  proces¬ 
sor  since  it  has  to  .stall  the  fetch  of  instructions  until  the  branch  is  solved,  i.e.  the  destina¬ 
tion  address  is  computed  and  the  condition  is  evaluated.  Branch  prediction  is  the 
mechanism  that  current  microprocessors  implement  in  order  to  overcome  control 
dependences.  It  is  based  on  the  prediction  of  the  outcome  of  branches  which  allows 
instructions  that  depend  on  a  branch  to  be  executed  before  the  result  of  such  branch  is 
known. 
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Data  dependences  or  true  dependences  appear  when  an  instruction  consumes  the 
value  produced  by  another  previous  instruction.  These  dependences  are  enforced  in  cur¬ 
rent  microprocessors  by  executing  the  consumer  after  the  producer.  Thus,  data  depend¬ 
ences  limit  the  amount  of  instruction  level  parallelism  (ILP)  by  imposing  a  serialization 
on  the  execution  of  some  instructions. 

In  the  same  way  as  control  dependences  are  managed  predicting  the  behavior  of 
branches,  It  may  be  feasible  to  predict  the  result  of  some  instructions  in  order  to  avoid 
the  ordering  imposed  by  data  dependences,  allowing  the  consumer  instruction  to  be 
issued  before  the  execution  of  the  producer.  The  term  data  value  speculation  is  used  to 
refer  to  those  mechanisms  that  predict  the  operands  of  an  instruction,  either  source  or 
destination,  and  execute  speculatively  the  instructions  dependent  on  it  before  the  actual 
value  is  computed,  allowing  the  processor  to  avoid  the  ordering  imposed  by  data 
dependences.  ^ 

In  this  work,  we  present  a  study  of  the  ILP  improvement  that  data  value  speculation 
techniques  can  provide.  We  present  an  evaluation  of  the  limits  of  ILP  that  can  be 
exploited  by  dynamically  scheduled  processors  with  infinite  resources  and  data  value 
speculation,  and  compare  it  with  that  of  the  same  processor  without  data  value  specula¬ 
tion.  We  evaluate  the  benefits  of  predicting  individual  types  of  instructions  (loads. 
Stores,  simple  arithmetic,  and  multiplications)  and  the  improvement  achieved  by  pre- 
icting  all  of  them.  We  consider  both  ideal  prediction  schemes  and  realistic  ones, 
inally,  the  impact  of  data  value  speculation  for  a  limited  instruction  window  is  also 
evaluated  The  results  shows  that  data  value  speculation  can  significantly  increase  the 
ILP  that  dynamically  scheduled  processors  can  exploit,  and  therefore,  it  is  a  promising 
technique  to  be  considered  for  future  generation  microprocessors. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  reviews  the  related  work.  The 
methodology  to  evaluate  the  ILP  that  can  be  exploited  by  an  ideal  processor,  either  with 
or  vvithout  data  value  speculation,  is  described  in  section  3.  The  value  predictors  consid¬ 
ered  in  this  work  are  presented  in  section  4.  The  results  of  this  study  are  detailed  in  sec¬ 
tion  5.  Finally,  section  6  summarizes  the  main  conclusions  of  this  work. 


2  Related  work 


There  have  been  a  plethora  of  works  dealing  with  the  limits  of  the  ILP 
[l][2][6][10]n6][20][21].  Each  work  studies  the  ILP  that  could  be  exploited  under 
some  constraints  such  as  fetch  width,  instruction  window  size,  branch  prediction,  regis¬ 
ter  renaming,  memory  aliasing,  etc.  A  conclusion  that  can  be  extracted  from  all  these 
works  IS  that  one  of  the  main  features  that  limit  the  parallelism  are  data  dependences 
For  instance,  in  [5]  it  is  shown  that  the  maximum  DLP  that  a  processor  could  achieve 
with  infinite  resources  and  perfect  branch  prediction  is  not  much  higher  than  a  few  hun¬ 
dred  instructions  per  cycle  (IPC)  and  for  some  applications  it  is  about  a  few  tens  of  IPC. 
ri.n  speculation  has  been  the  focus  of  several  recent  works.  It  is  performed  in 

[  4]  by  predicting  the  address  of  load  instructions  whereas  in  [9]  the  address  of  stores  is 
also  predicted.  In  both  cases  the  prediction  is  carried  out  using  a  history  table  of  mem¬ 
ory  instructions  and  a  stride  based  predictor.  In  [12],  data  value  speculation  is  based  on 
predicting  the  value  that  load  instructions  read  from  memory.  The  proposed  mechanism 
exploits  the  feature  that  the  authors  call  value  locality,  which  refers  to  the  fact  that  many 
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load  instructions  repeatedly  bring  the  same  value  from  memory.  Value  locality  is 
extended  for  all  type  of  instructions  in  [11],  In  [8]  data  value  speculation  is  performed 
by  predicting  the  value  read  by  load  instructions.  Unlike  the  mechanism  proposed  in 
[12],  the  load  values  are  predicted  by  predicting  their  effective  address  and  prefetching 
the  data  from  memory  into  the  history  table.  In  [15]  Sazeides  and  Smith  show  that  the 
results  that  an  instruction  generates  may  follow  a  repetitive  pattern  that  stride  predictors 
cannot  predict  and  propose  a  context-based  predictor.  In  [22]  Wang  and  Franklin  present 
a  hybrid  predictor.  The  implementation  of  this  predictor  is  similar  to  that  of  a  2-level 
branch  predictor.  In  [7]  the  impact  of  different  value  predictors  on  the  performance  of  a 
processor  is  studied  using  a  limited  instruction  window. 

The  main  contributions  of  this  work  are  the  following;  This  is  the  first  work  to  our 
knowledge  that  evaluates  the  limits  of  ILP  in  an  ideal  dynamically  scheduled  supersca¬ 
lar  processor  that  exploits  data  value  speculation  and  compares  it  with  that  of  the  same 
processor  without  data  value  speculation.  In  [1 1],  value  prediction  is  evaluated  for  a  per¬ 
fect  machine,  as  it  is  called  by  the  authors.  However,  that  machine  is  limited  by  a  finite 
instruction  window  (4096  entries),  branch  prediction  and  fetch  bandwidth.  Besides,  in 
this  paper  we  study  the  benefits  of  predicting  individual  types  of  instructions  for  both 
ideal  and  realistic  predictors. 

3  Methodology 

This  section  describes  the  methodology  that  we  have  used  to  obtain  the  ILP  under  dif¬ 
ferent  scenarios  regarding  prediction  schemes  and  hardware  resources. 

3.1  Experimental  framework 

The  evaluation  methodology  is  trace-driven.  The  trace  of  each  program  has  been  gener¬ 
ated  using  the  ATOM  tool  [19].  For  each  instruction,  the  instrumentation  routine 
obtains:  its  operation  code,  the  source  and  target  registers,  the  effective  address  (if  the 
instruction  is  either  a  load  or  a  store),  and  the  value  generated  in  the  case  of  arithmetic 
and  load  instructions.  These  data  are  fed  into  the  analysis  program,  which  computes  the 
performance  achieved  by  the  particular  architectural  model.  Performance  is  reported  as 
Instructions  per  Cycle  (IPC). 

The  whole  SPEC95  benchmark  suite  has  been  used  for  the  different  experiments.  All 
the  benchmarks  have  been  compiled  for  a  DEC  AlphaStationfiOO  5/266  with  ‘-04’  opti¬ 
mization  flag,  and  executed  with  their  largest  input  set.  Each  program  has  been  run  for  5 
billion  of  instructions,  except  gcc  and  ijpeg,  which  have  been  run  until  completion 
(1,569,885,184  and  684,497,921  instructions  respectively).  Figure  1  details  the  percent¬ 
age  of  different  types  of  instructions  executed  for  the  whole  SPEC95  benchmark  suite. 

3.2  Architectural  model 

The  first  study  of  the  limits  of  ILP  is  achieved  assuming  an  ideal  microprocessor  with 
infinite  resources,  perfect  branch  prediction,  infinite  instruction  fetch  bandwidth,  an  infi¬ 
nite  cache  memory  with  infinite  number  of  ports,  perfect  memory  disambiguation, 
dynamic  renaming  with  an  infinite  number  of  registers  and  memory  renaming  with  infi- 
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Figure  I.  Dynamic  percentage  of  each  type  of  instructions 

nite  storage  locations  for  renaming.  Both  an  infinite  and  a  limited  instruction  window 
are  considered.  In  all  the  cases,  precise  exceptions  [17]  and  an  infinite  retirement  (com¬ 
mit)  bandwidth  are  assumed. 

3.3  IPC  computation  for  an  ideal  architecture  without  data  value  speculation 

The  IPC  of  a  given  program  for  a  particular  architectural  model  is  obtained  by  determin¬ 
ing  the  time  (measured  in  number  of  cycles)  when  the  latest  result  of  any  instruction  of 
the  program  is  computed,  and  then,  dividing  the  number  of  executed  instructions  by 
such  number  of  cycles. 

We  will  refer  to  the  cycle  when  the  result  of  an  instruction  i  is  available  as  the  com¬ 
pletion  time  of  I,  or  C7}  for  short.  CT]  is  computed  as  the  maximum  CTj  for  any  j  such 

thaty  produces  a  result  that  is  a  source  operand  of  i  plus  the  latency  of  the  operation  i. 
This  approach  is  similar  to  the  one  used  in  [1]. 

Each  instruction  of  the  trace  produced  by  the  execution  of  the  instrumented  program 
is  analyzed  in  order  to  know  the  time  when  its  operands  are  available.  For  each  storage 
location  the  analysis  program  keeps  track  of  the  CTof  the  last  instruction  that  wrote  to 
It.  This  is  implemented  by  means  of  two  tables  that  are  called  the  register  write  table 
(RWT)  and  the  memory  write  table  (MWT).  RWT,  stores  the  CTof  the  last  instruction  so 
far  that  its  destination  operand  was  the  logical  register  r.  MWT,,  stores  the  CTof  the  last 
store  that  wrote  into  address  a. 

Therefore,  when  an  arithmetic  instruction  is  processed,  the  /?VVTis  accessed  in  order 
to  obtain  the  cycle  that  the  source  operands  are  available.  Then,  its  CT  is  computed  and 
the  entry  associated  to  its  destination  register  is  updated  with  the  new  computed 
CT.  That  is: 

^'^dest  =  >nax  (/?UT^;.^./,  RWT, ,,.2}  +  Latency (j ) 

In  a  similar  way,  when  a  load  from  address  a  is  processed,  the  MWT  is  accessed  to 
obtain  the  cycle  that  a  previous  store  wrote  into  that  memory  position.  Then,  the  RWT  is 
updated  as  follows: 
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f^^dest  =  (^^srcl-  ^^src2>  MWTJ  +  Latency, (2) 

Finally,  when  a  store  to  address  a  is  processed,  the  MWT  is  updated  to  reflect  the  new 
write  to  this  memory  location: 

MWT^,  =  max  (RV/T^rcP  ^^src2)  +  store  (^) 

Notice  that  the  new  or  MWT^  can  be  lower  that  the  previous  one  because 

register  and  memory  renaming  is  assumed.  Dynamic  register  renaming  is  very  common 
in  current  architectures.  Memory  renaming  is  much  more  complex  and  it  is  imple¬ 
mented  to  some  extent  by  some  mechanisms  like  the  ARB  of  the  Multiscalar  [3].  In  this 
paper,  we  assume  unlimited  renaming  capabilities  for  both  registers  and  memory. 

When  a  new  value  for  RWT  or  MVi/T  is  computed,  the  previous  value  is  overwritten 
because  any  further  instruction  in  the  trace  will  always  refer  to  the  last  value  stored  into 
a  register  or  a  memory  location.  However,  in  order  to  compute  the  IPC,  we  have  to 
determine  the  maximum  CT  for  any  instruction  of  the  program.  To  obtain  such  value, 
the  analysis  program  keeps  a  variable  that  stores  the  maximum  CTup  to  the  current  exe¬ 
cution  point  (Max_CT). 

3.4  IPC  computation  for  a  limited  instruction  window 

A  limited  instruction  window  with  W  entries  and  in-order  retirement  implies  that  an 
instruction  cannot  start  execution  until  the  instruction  W  locations  above  in  the  trace  and 
all  previous  instructions  have  completed  and  retired.  Thus,  the  restriction  of  having  a 
limited  instruction  window  can  be  modeled  by  keeping  track  of  the  CT  of  the  last  W 
instructions.  This  is  accomplished  by  means  of  a  table,  which  is  called  window  retire¬ 
ment  time  {WRT),  that  has  W  entries  and  stores  the  retirement  time  of  the  last  W  instruc¬ 
tions  processed  so  far. 

Thus,  when  computing  the  CT  of  an  instruction,  in  addition  to  consider  the  CT  of  its 
source  operands,  the  WRT  of  the  instruction  W  locations  above  has  also  to  be  consid¬ 
ered.  For  instance,  for  each  arithmetic  instruction  processed  by  the  analysis  program, 
the  corresponding  entry  in  the  RWT  is  updated  as  follows: 

^^dest  =  (f^'^srch  ^^src2>  ^^^'^njnst%w)  +  Latency  operation  (4) 

where  njnst  refers  to  the  ordinal  number  of  the  current  instruction  in  the  trace.  Expres¬ 
sions  (2)  and  (3)  are  modified  in  a  similar  way  to  account  for  the  effect  of  the  limited 
instruction  window. 

For  each  new  instruction,  the  WRT  is  updated  to  reflect  the  retirement  (commit)  time 
of  the  current  instruction.  This  time  is  the  maximum  CT  of  any  previous  instruction, 
including  the  current  one,  and  it  is  stored  in  the  same  entry  of  the  WRT  that  was  occu¬ 
pied  by  the  instruction  W  locations  above  since  it  is  not  useful  any  more: 

'^f^Tnjnst%V/  =  Max_CT  (5) 
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Figure  2.  A  stride-based  predictor. 
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3.5  IPC  computation  for  data  value  speculation 

Data  value  speculation  is  based  on  predicting  the  source  and/or  the  destination  operands 
of  some  instructions.  In  this  section,  we  present  a  methodology  to  compute  the  IPC 
when  data  value  speculation  is  incorporated  into  a  superscalar  processor,  independently 
of  the  particular  predictor  being  used.  In  this  way,  we  consider  a  predictor  as  a  system 
that  given  an  instruction  (usually  its  program  counter),  provides  its  source  and/or  desti¬ 
nation  operands.  In  addition,  each  individual  prediction  is  characterized  by  the  time 
when  the  prediction  is  available  (PT)  and  the  correctness  of  the  prediction. 

In  this  paper,  we  consider  data  value  speculation  for  the  following  type  of  instruc- 
rions:  Loads,  Stores,  Integer  Arithmetic,  Integer  Multiplication,  Float  Arithmetic  and 
Float  Multiplication. 

In  all  the  cases,  if  a  prediction  is  not  correct,  the  RWT  and  MWTavt  updated  as  if  pre¬ 
diction  were  not  used.  If  the  prediction  is  correct,  the  RWT  and  MWT  axQ  updated  with 
the  minimum  between  the  completion  time,  given  by  expressions  (1),  (2)  and  (3),  and 
the  prediction  time,  which  is  a  characteristic  of  the  particular  predictor  being  used.  Sec¬ 
tion  4  discusses  the  predictors  considered  in  this  work  and  in  particular,  the  time  when 
predictions  are  available. 


4  Data  predictors 


In  this  work  we  consider  stride-based  predictors,  although  the  presented  methodolosy 
could  be  applied  for  any  other  data  predictor.  A  stride  predictor  has  the  structure  shown 
in  Figure  2.  It  is  implemented  by  means  of  a  table  of  4096  entries  that  is  direct-mapped, 
non-tagged  and  it  is  indexed  with  the  least  significant  bits  of  the  instruction  address  (PC) 
whose  source  or  destination  operands  are  to  be  predicted.  Each  entry  stores  the  follow¬ 
ing  information: 

Last  value:  This  is  the  last  value  seen  by  that  instruction.  This  value  corresponds 
to  the  destination  operand  for  all  predictors  except  for  the  load  and  store  address 
predictors.  In  these  cases,  it  corresponds  to  the  last  effective  address. 

Stride:  This  field  contains  the  stride  observed  for  the  values  of  the 
corresponding  instruction. 
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•  Confidence;  This  field  is  used  to  assign  confidence  to  the  prediction.  It  is 
implemented  by  means  of  a  2-bit  up/down  saturated  counter.  A  prediction  is 
considered  correct  only  if  the  most  significant  bit  is  set. 

Predictor  for  arithmetic  instructions  stores  the  last  result  in  the  last  value  field.  Load 
address  predictors  store  the  last  effective  address.  Load  value  predictors  store  the  last 
value  read  from  memory.  Finally,  store  predictors  uses  two  tables:  one  for  predicting  the 
effective  address  and  the  other  for  predicting  the  value  to  be  written. 

When  an  instruction  is  to  be  predicted  (either  its  result  or  its  effective  address, 
depending  on  the  particular  predictor),  the  prediction  table  is  accessed  and  the  predicted 
value  is  computed  adding  the  stride  to  the  previous  last  value.  If  the  most  significant  bit 
of  the  confidence  field  is  set  (i.e.,  the  prediction  is  considered  to  be  correct)  and  the  pre¬ 
diction  is  correct,  the  predicted  value  can  be  used  instead  of  the  actual  value  if  the 
former  is  available  earlier.  The  stride  field  is  only  updated  if  the  confidence  counter  is 
below  IO2  after  being  updated. 

In  addition,  we  consider  a  perfect  predictor  that  is  assumed  to  produce  always  correct 
predictions.  This  is  used  to  determined  the  upper  bound  of  the  performance  that  data 
value  speculation  can  achieve. 

4.1  Prediction  time 

An  important  feature  of  a  predictor  is  the  time  when  the  predicted  value  is  available. 
This  time  is  used  to  update  the  RWT  and  MWT  as  explained  in  section  3.5. 

Regarding  the  prediction  time,  two  different  types  of  predictors  have  been  consid¬ 
ered: 

•  Serialized:  Every  time  the  prediction  table  is  accessed,  only  one  prediction  per 
static  instruction  can  be  performed  at  most.  That  is,  an  instruction  is  not 
predicted  until  the  last  execution  of  the  same  static  instruction  has  been 
predicted. 

•  Non-serialized:  Every  time  the  prediction  table  is  accessed,  multiple  predictions 
for  each  static  instruction  can  be  performed.  In  particular,  all  the  subsequent 
executions  of  the  same  static  instruction  are  predicted  until  the  first  one  that  is 
incorrect.  That  is,  once  the  corresponding  entry  of  the  table  has  the  correct 
stride,  successive  executions  of  the  same  static  instructions  can  be  predicted  all 
at  once. 

The  serialized  predictors  may  be  suitable  for  superscalar  processors.  In  fact,  most  of 
the  studies  on  value  prediction  assume  this  type  of  predictors  [8][9][1 1][12](14].  A  non- 
serialized  predictor  could  be  useful  for  architectures  supporting  multiple  threads  of  con¬ 
trol  obtained  from  a  single  program,  such  as  multiscalar  processors  [  1 8]  and  the  specu¬ 
lative  multithreaded  processors  [13]. 

To  determine  the  time  when  a  prediction  is  available  we  consider  a  parameter  that 
reflects  the  time  required  to  perform  a  prediction  operation  (either  of  a  single  value  for 
the  serialized  approach  or  multiple  values  for  the  non-serialized  one).  This  parameter  is 
called  the  prediction  latency  (PL).  This  is  the  time  required  for  a  table  look-up  plus  its 
update. 
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The  prediction  time  of  each  instruction  is  determined  by  means  of  an  additional  field 
t  at  is  added  to  each  entry  of  the  prediction  table  for  evaluation  purposes.  This  field 
stores  the  cycle  in  which  the  entry  has  been  used/updated  for  the  last  time.  This  field 
will  be  called  last  update  time  (LUT). 

The  prediction  time  for  an  instruction  is  just  the  sum  of  the  last  update  time  plus  the 
prediction  latency.  That  is; 

PT=LUT+PL 

The  LUT  is  updated  in  a  different  way  for  serialized  and  non-serialized  predictors, 
or  the  former,  for  each  new  instruction  of  the  trace,  the  corresponding  LUT  is  updated 

with  the  time  when  its  operand  is  available  (either  computed  or  predicted,  whichever 
occurs  first): 

LUT  =  RWTjg^for  load  and  arithmetic  instructions  with  destination  register  dest 

LUT  =  MWT^  for  stores  to  address  a  (j) 

For  non-serialized  predicmrs,  the  LUT  field  is  updated  in  the  same  way  as  the  serial¬ 
ized  case  but  only  for  those  instructions  that  are  mispredicted  or  are  considered  not  pre¬ 
dictable  as  stated  by  the  confidence  field. 

5  Results 

The  results  of  this  section  assume  a  one-cycle  latency  for  all  instructions  and  one-cvcle 
prediction  latency.  ^ 

Table  1  shows  the  IPC  achieved  by  the  ideal  processor  described  in  section  3  2  with 
an  infinite  instruction  window  and  without  data  value  speculation 

This  results  will  be  used  as  a  baseline  to  compare  the  performance  of  data  value  spec¬ 
ulation  techniques.  They  represent  the  maximum  parallelism  that  is  possible  to  achieve 
m  an  ideal  processor  that  is  only  constrained  by  data  dependences  whereas  data  value 
speculation  removes  this  constraint.  Notice  that  even  for  this  ideal  machine,  the  average 

for  floating  point  applications.  When 
we  add  the  constraint  of  a  limited  instruction  window  of  128  instructions,  the  IPC  goes 
down  to  9.64  and  17.51  respectively.  This  may  suggest  that  relieving  the  restrictions 
imposed  by  data  dependences  through  data  value  speculation  can  be  and  interesting^ 
mechanism  m  boost  perfoimance.  In  the  following  results,  only  the  average  result  for 
integer  and  FP  programs  will  be  shown. 

Figure  3  shows  the  speedup  (in  logarithmic  scale)  achieved  by  data  value  speculation 
with  perfect  prediction  in  relation  to  the  infinite  machine  without  data  value  speculation. 

In  this  figure  and  the  following  ones  the  speedup  is  computed  as  follows; 

Speedup  =  IPC  with  data  value  speculation 
IPC  without  data  value  speculation 

In  each  bar,  only  a  single  type  of  instructions  is  predicted  individually.  With  perfect 
prediction,  when  an  instruction  is  predicted  its  result  is  considered  to  be  available  at 
cycle  0.  Looking  at  the  graphs,  one  can  see  that  the  potential  performance  of  predicting 


592 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


Table  1.  IPC  achieved  with  infinite  resources  and  no  data  value 
speculation 


SpecInt 

IPC 

SpecFP 

IPC 

go 

89.45 

tomcatv 

397.79 

mSSksim 

17.14 

swim 

1403.82 

gcc 

47.02 

su2cor 

56.64 

compress 

35.71 

hydro2d 

181.09 

li 

27.62 

applu 

578.31 

ijpeg 

34.12 

mgrid 

4735.11 

perl 

18.72 

turb3d 

140.19 

vortex 

29.34 

apsi 

231.21 

fpppp 

105.71 

wave5 

73.02 

Average 

37.39 

Average 

790.29 

memory  instructions,  both  loads  and  stores,  is  less  than  the  speedup  achieved  by  predict¬ 
ing  arithmetic  instructions.  This  suggests  that  for  the  analyzed  programs,  there  are  much 
more  arithmetic  than  memory  instructions  on  critical  paths.  The  speedup  achieved  by 
predicting  multiplications  is  almost  negligible.  In  addition  to  not  being  on  critical  paths, 
this  may  be  due  to  the  small  percentage  of  multiplication  operations,  as  shown  in  Figure 

Figure  4  shows  the  speedup  obtained  for  a  realistic  prediction  scheme  based  on  a 
stride  predictor,  as  it  was  described  in  previous  sections.  The  instruction  window  is  con¬ 
sidered  to  be  infinite  and  the  prediction  is  non-serialized.  The  speedup  achieved  by  pre¬ 
dicting  arithmetic  instruction  is  very  huge  and  it  suggests  that  arithmetic  prediction  may 
be  the  most  effective  approach  to  remove  the  serialization  imposed  by  data  depend¬ 
ences.  The  IPC  of  data  value  speculation  just  for  arithmetic  instructions  is  531  times 
higher  than  the  IPC  achieved  without  data  value  speculation,  for  an  infinite  machine  and 
the  FP  benchmarks.  When  data  value  speculation  is  implemented  for  all  the  instructions, 
the  speedup  goes  up  to  2368.  The  speedup  for  integer  programs  is  not  so  high  (42  when 
predicting  all  the  instructions).  On  the  other  hand,  the  speedup  achieved  by  predicting 
memory  instructions  is  much  more  limited  (1.4  and  4.8  for  integer  and  FP  benchmarks 
respectively  when  predicting  stores  and  load  values).  Predicting  multiplications  is  not 
considered  any  more  due  to  the  poor  results  observed  for  the  perfect  predictor. 

The  speedup  obtained  with  a  serialized  predictor  is  depicted  in  Figure  5.  Notice  that, 
as  pointed  out  before,  this  scheme  would  correspond  to  the  implementation  of  data 
value  speculation  on  a  superscalar  processor  since  in  such  processors  there  is  only  one 
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Load  predictor 
Store  predictor 
MiiitipJication  predictor 
Arithmetic  predictor 


Figure  3.  Speedup  achieved  by  data  speculation  with  perfect  prediction,  for 
different  types  of  predictors. 


flow  of  control  and  a  given  execution  of  a  static  instruction  can  be  predicted  only  if  its 
previous  execution  has  updated  the  prediction  table.  On  the  other  hand,  a  non-serialized 
predictor  can  be  exploited  by  an  architecture  supporting  multiple  threads  of  control. 

The  speedup  achieved  by  serialized  prediction  is  still  quite  significant.  The  IPC 
achieved  by  these  schemes  is  30  and  35  times  higher  than  the  IPC  achieved  without  data 
value  speculation  for  integer  and  FP  programs  respectively.  These  results  also  show  that 
the  potential  gain  that  load  prediction  may  achieve  is  slightly  higher  for  value  prediction 
than  for  address  prediction,  but  this  gain  is  insignificant  when  compared  to  arithmetic 
prediction. 

If  we  compare  the  speedup  achieved  by  non-serialized  prediction  (Figure  4)  against 
the  speedup  achieved  by  serialized  prediction  (Figure  5)  we  can  observe  that  for  integer 
benchmarks  there  is  not  much  difference  (e.g.  it  goes  from  42  to  30  when  predicting  all 
the  instructions)  whereas  for  FP  benchmarks  the  difference  is  huge  (e.g.  it  goes  from 
2368  to  35  when  predicting  all  the  instructions).  The  main  reason  for  this  different 
behavior  in  the  two  types  of  benchmarks  can  be  explained  through  the  figures  in  Table. 
2.  This  table  shows  the  percentage  of  correctly  predicted  arithmetic  instructions  for 
which  the  completion  time  {CT)  is  lower  than  prediction  time  (P7).  For  these  instruc¬ 
tions,  the  prediction  does  not  provide  any  improvement  in  spite  of  being  correct.  As 
expected,  this  percentage  is  greater  when  the  predictions  are  serialized  than  when  they 
are  not  since  the  prediction  time  of  the  serialized  scheme  is  in  general  higher.  Besides, 
the  difference  between  serialized  and  non-serialized  schemes  for  FP  benchmarks  is 
much  higher  than  for  integer  benchmarks,  which  explains  the  higher  impact  of  serial¬ 
ized  prediction  for  FP  benchmarks,  as  observed  in  Figure  4  and  Figure  5. 

The  speedup  achieved  by  predicting  instructions  relies  on  the  amount  of  strided  val¬ 
ues  existing  among  the  applications.  Figure  6  shows  the  percentage  of  strided  values  for 
the  different  instruction  types  for  the  whole  Spec95  benchmark  suite.  It  can  be  seen  that 
load  addresses  have  the  greatest  percentage  of  strided  references  and  therefore  one  may 
expect  a  speedup  tor  load  address  speculation  higher  than  it  actually  is  (see  Figure  4  and 
Figure  5).  However,  even  when  the  address  of  a  load  is  predicted,  it  has  to  wait  for  pre¬ 
vious  stores  to  the  same  address  to  finish.  On  the  other  hand,  predicting  the  value  of  a 


594 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


Load  Address  prediction 
O  Load  Value  prediction 
a  Store  prediction 


Figure  4.  Speedup  achieved  by  data  value  speculation  with  non- 
serialized  prediction 


Figure  5.  Speedup  achieved  by  data  value  speculation  with  serialized 
predictions 


Table  2.  Percentage  of  correctly  predicted 
instructions  whose  CT  is  lower  than  its  PT. 


Non-serialized 

Serialized 

Specint 

59.65 

70.85 

SpecFp 

48.31 

90.64 

load  or  the  result  of  any  other  instruction  avoids  completely  the  order  imposed  by  data 
dependences.  Simple  arithmetic  instructions  (mainly  integer  arithmetic)  has  a  high  per¬ 
centage  of  strided  values.  This  fact,  along  with  the  significant  weight  of  arithmetic 
instructions  on  the  critical  path  (as  confirmed  in  the  evaluation  of  the  prefect  prediction 
scheme),  makes  arithmetic  prediction  to  be  the  most  effective  type  of  speculation  among 
the  ones  evaluated  in  this  work. 

Finally,  we  consider  the  impact  of  data  value  speculation  with  a  limited  instruction 
window.  Figure  7  shows  the  speedup  of  data  value  speculation  (IPC  achieved  by  data 
value  speculation  divided  by  IPC  achieved  without  data  value  speculation)  when  all 
types  of  instructions  are  predicted  using  separate  history  tables  for  each  class,  and  pre¬ 
dicting  the  value  of  loads.  A  non-serialized  predictor  is  considered  since  it  outperfonns 
a  .serialized  predictor  for  an  infinite  window  (notice  that  the  speedup  is  not  depicted  in 
logarithmic  scale  but  in  linear  scale).  It  can  be  seen  in  this  figure  that  the  impact  of  the 
size  of  the  instruction  window  its  very  significant  since,  for  instance,  the  speedup  is 
decreased  from  2368  to  only  1.75  for  a  window  of  512  instructions  in  the  SpecFp  pro- 
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C3  Spcclni 
SpccFp 


Figure  6.  Percentage  of  strided  values  for  each  type  of  instruction 


Figure  7.  Speedup  achieved  with  a  finite  instruction 


Spccinl 
— “  SpccFp 


grams.  Furthermore,  the  gain  due  to  data  value  speculation  for  the  Specint  outperforms 

the  gam  for  SpecFp,  which  is  the  opposite  to  what  happened  with  an  infinite  instruction 
window. 

A  mam  conclusion  of  the  study  of  the  effect  of  data  value  speculation  on  a  limited 
instruction  window  is  that  it  is  an  effective  technique  that  could  be  considered  for  future 
pneration  microprocessors.  A  speedup  around  2  can  be  achieved  with  simple  stride- 
based  predictors.  However,  the  potential  benefits  of  data  value  speculation  are  much 
higher  for  very  large  instructions  windows.  In  this  scenario,  conventional  superscalar 
microprocessors  have  been  shown  to  be  rather  limited  in  the  amount  ILP  that  they  can 
exploit  due  mainly  m  data  dependences.  This  limitation  can  be  significantly  relieved  by 
ata  value  speculation  techniques.  Thus,  novel  organizations  to  support  large  instruc¬ 
tions  windows,  like  the  multiscalar  architecture  [18]  and  speculative  multithreaded  pro- 
cessor[13]  can  be  benefitted  from  data  value  speculation  to  a  larger  extent  than 
superscalar  processors. 


6  Conclusions 


n  a  study  of  the  limits  of  instruction  level  parallelism 

(ILP)  that  can  be  exploited  by  a  machine  with  infinite  resources,  infinite  instruction  win¬ 
dow,  perfect  branch  prediction  and  ideal  memory.  We  have  shown  that  avoiding  the 
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ordering  imposed  by  data  dependences  is  a  promising  approach  to  improve  the  perfor¬ 
mance  of  superscalar  processors  for  future  generations.  This  can  be  accomplished  by 
data  value  speculation  techniques.  These  techniques  are  based  on  predicting  the  source 
or  destination  operands  of  instructions  and  execute  speculatively  the  instructions  depen¬ 
dent  on  them. 

Data  value  speculation  has  been  approached  by  means  of  both  perfect  and  stride- 
based  predictors.  Two  different  types  of  prediction  schemes  have  been  studied:  serial¬ 
ized  and  non-serialized.  The  former  is  oriented  to  superscalar  processors  whereas  the 
latter  is  more  suitable  for  multithreaded  architectures  (i.e.,  machines  that  support  multi¬ 
ple  threads  of  control  from  a  single  program).  We  have  measured  the  benefits  of  data 
value  speculation  techniques  by  comparing  the  limits  of  ELP  that  can  be  exploited  with 
such  technique  with  that  of  a  superscalar  processor  with  the  same  features  but  without 
data  value  speculation.  Results  show  an  important  speedup  for  arithmetic  instructions 
both  for  serialized  and  non-serialized  prediction  schemes.  We  have  also  observed  that 
the  difference  between  these  two  schemes  is  very  high  for  FP  programs  (non-serialized 
outperforms  always  serialized  schemes)  but  it  is  relatively  low  for  integer  programs. 

Finally,  we  have  evaluated  the  impact  of  data  value  speculation  with  a  limited 
instruction  window.  We  have  observed  that  the  speedup  suffers  an  important  reduction 
but  it  is  still  significant.  However,  the  benefits  of  data  value  speculation  increases  with 
the  instruction  size.  We  believe  that  data  value  speculation  may  play  an  important  role 
when  it  is  combined  with  mechanisms  to  support  large  instruction  windows. 
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Abstract.  Matter  in  the  universe  mainly  consists  of  plasma.  The  dy¬ 
namics  of  plasmas  is  controlled  by  magnetic  fields.  To  simulate  the  evo¬ 
lution  of  magnetized  plasma,  we  solve  the  equations  of  magnetohydro¬ 
dynamics  using  the  Versatile  Advection  Code  (VAC). 

To  demonstrate  the  versatility  of  VAC,  we  present  calculations  of  the 
Rayleigh-Taylor  instability,  causing  a  heavy  compressible  gas  to  mix  into 
a  lighter  one  uiiderneath,  in  an  external  gravitational  field.  Using  a  single 
source  code,  we  can  study  and  compare  the  development  of  this  insta¬ 
bility  in  two  and  three  spatial  dimensions,  without  and  with  magnetic 
fields.  The  results  are  visualised  amd  analysed  using  IDL  (Interactive 
Data  Language)  and  AVS  (Advanced  Visual  Systems). 

The  example  calculations  are  performed  on  a  Cray  J90.  VAC  also  runs 
on  distributed  memory  architectures,  after  automatic  translation  to  High 
Performance  Fortran.  We  present  performance  and  scaling  results  on  a 
variety  of  architectures,  including  Cray  T3D,  Cray  T3E,  and  IBM  SP 
platforms. 


1  MagnetoHydroDynamics 

The  MHD  equations  describe  the  behaviour  of  a  perfectly  conducting  fluid  in  the 
presence  of  a  magnetic  field.  The  eight  primitive  variables  are  the  density  p(r,  t), 
the  three  components  of  the  velocity  field  v(r,  t),  the  thermal  pressure  p(r,  t),  and 
the  three  components  of  the  magnetic  field  B(r,  t).  When  written  in  conservation 
form,  the  conservative  variables  are  density  p,  momentum  pv,  energy  density  f, 
and  the  magnetic  field  B.  The  thermal  pressure  p  is  related  to  the  energy  density 
as  p  =  (7  -  l)(il  -  ipu^  _  ^B^),  with  7  the  ratio  of  specific  heats.  The  eight 
non-linear  partial  differential  equations  express;  (1)  mass  conservation;  (2)  the 
momentum  evolution  (including  the  Lorentz  force);  (3)  energy  conservation;  and 

induction  equation.  The  equations 

=  (1) 


(4)  the  evolution  of  the  magnetic  field  in  ai 
are  given  by 

§  +  V.(pv 
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5(pv) 

+  V  •  [pvv  +  ptotl  -  BB]  =  pg,  (2) 

d  ^ 

■^  +  V  •  (Ev)  +  V  •  -  V  •  (v  •  BB)  =  pg  •  V  +  V  •  [B  X  77(V  X  B)] ,  (3) 

dB 

+  V  •  (vB  -  Bv)  =  -V  X  [77(V  X  B)] .  (4) 

We  introduced  ptot  =  p  +  as  the  total  pressure,  I  as  the  identity  tensor, 
g  as  the  external  gravitational  field,  and  defined  magnetic  units  such  that  the 
magnetic  permeability  is  unity. 

Ideal  MHD  corresponds  to  a  zero  resistivity  77  and  ensures  that  magnetic  flux 
is  conserved.  In  resistive  MHD,  field  lines  can  reconnect.  An  extra  constraint 
arises  from  the  non-existence  of  magnetic  monopoles,  expressed  by  V  ■  B  =  0. 
The  ideal  MHD  equations  allow  for  Alfven  and  magnetoacoustic  wave  modes, 
while  the  induction  equation  prescribes  that  flow  across  the  magnetic  field  en¬ 
trails  the  field  lines,  so  that  field  lines  are  ‘ffozen-in’.  The  field  may,  in  turn, 
confine  the  plasma.  The  MHD  description  can  be  used  to  study  both  laboratory 
and  astrophysical  plasma  phenomena.  We  refer  the  interested  reader  to  [2]  for 
a  derivation  of  the  MHD  equations  starting  from  a  kinetic  description  of  the 
plasma,  while  excellent  treatments  of  MHD  theory  can  be  found  in,  e.g.  [4, 1], 

2  The  Versatile  Advection  Code 

The  Versatile  Advection  Code  (VAC)  is  a  general  purpose  software  package  for 
solving  a  conservative  system  of  hyperbolic  partial  differential  equations  with 
additional  non-hyperbolic  source  terms  [10,11],  in  particular  the  hydrodynamic 
(B  =  0)  and  magnetohydrodynamic  equations  (l)-(4),  with  optional  terms  for 
gravity,  viscosity,  thermal  conduction,  and  resistivity. 

VAC  is  implemented  in  a  modular  way,  which  ensures  its  capacity  to  model 
several  systems  of  conservation  laws,  and  makes  it  possible  to  share  solution 
algorithms  among  all  systems.  A  variety  of  spatial  and  temporal  discretizations 
are  implemented  for  solving  such  systems  on  a  finite  volume  structured  grid. 
The  spatial  discretizations  include  two  Flux  Corrected  Transport  variants  and 
four  Total  Variation  Diminishing  (TVD)  algorithms  (see  [15]).  These  numerical 
schemes  are  shock-capturing  and  second  order  accurate  in  space  and  time. 

Explicit  time  integration  may  exploit  predictor-corrector  and  Runge-Kutta 
time  stepping,  while  for  multi-timescale  problems,  mixed  implicit/explicit  time 
integration  is  available  to  treat  only  some  variables,  or  some  terms  in  the  gov¬ 
erning  equations  implicitly  [7].  Fully  implicit  time  integration  can  be  of  interest 
when  modeling  steady-state  problems.  Typical  astrophysical  applications  where 
semi-implicit  and  implicit  methods  are  efficiently  used  can  be  found  in  [8, 14]. 

VAC  runs  on  personal  computers  (Pentium  PC  under  Linux),  on  a  variety  of 
workstations  (DEC,  Sun,  HP,  IBM)  and  has  been  used  on  SGI  Power  Challenge, 
Cray  J90  and  Cray  C90  platforms.  To  run  VAC  on  distributed  memory  archi¬ 
tectures,  an  automatic  translation  to  High  Performance  Fortran  (HPF)  is  done 
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at  the  preprocessing  phase  (see  [9]).  We  have  tested  the  generated  HPF  code 
on  several  platforms,  including  a  cluster  of  Sun  workstations,  a  Cray  T3D,  a 
16-node  Connection  Machine  5  (using  an  automatic  translation  to  CM- Fortran), 
an  IBM  SP  and  a  Cray  T3E.  Scaling  and  performance  is  discussed  in  section  3. 

On-line  manual  pages,  general  visualization  macros  (for  IDL,  MatLab  and 
SM),  and  file  format  transformation  programs  (for  AVS,  DX,  and  Gnuplot)  fa¬ 
cilitate  the  use  of  the  code  and  aid  in  the  subsequent  data  analysis. 

In  this  manuscript,  we  present  calculations  done  in  two  and  three  spatial 
dimensions,  for  both  hydrodynamic  and  magnetohydrodynamic  problems.  This 
serves  to  show  how  VAC  allows  a  single  problem  setup  to  be  studied  under 
various  physical  conditions.  We  have  used  IDL  and  AVS  to  analyse  the  appli¬ 
cation  presented  here.  Our  data  analysis  and  visualization  encompasses  X-term 
animation,  generating  MPEG-movies,  and  video  production. 


3  Scaling  results 


As  detailed  in  [9],  the  source  code  uses  a  limited  subset  of  the  Fortran  90  lan¬ 
guage,  extended  with  the  HPF  forall  statement  and  the  Loop  Annotation  SYntax 
(LASY)  which  provides  a  dimension  independent  notation.  The  LASY  nota¬ 
tion  [12]  IS  translated  by  the  VAC  preprocessor  according  to  the  dimensionality 
of  the  problem.  Further  translation  to  HPF  involves  distributing  all  global  non- 
static  arrays  across  the  processors,  which  is  accomplished  in  the  preprocessing 
stage  by  another  Perl  script. 

Figure  1  summarizes  timing  results  obtained  on  two  vector  (Cray  J90  and 
C90)  and  three  massively  parallel  platforms  (Cray  T3D,  T3E  and  IBM  SP). 
We  solve  the  shallow  water  equations  (l)-(2)  with  B  =  0  and  p  =  {g/2)p^  on 
a  104  X  104  grid  on  1,  2,  4,  8,  and  13  processors.  This  simple  model  problem 
is  described  in  [13],  and  our  solution  method  contains  the  full  complexity  of  a 
real  physics  application.  We  used  an  explicit  TVD  scheme  exploiting  a  Roe-type 
approximate  Riemann  solver.  We  plot  the  number  of  physical  grid  cell  updates 
per  second  against  the  number  of  processors  (solid  lines).  The  dashed  lines  show 
the  improved  scaling  for  a  larger  problem  of  size  208  x  208,  up  to  16  processors 
On  all  parallel  platforms,  we  exploited  the  Portland  Group  pghpf  compiler.  We 
find  an  almost  linear  speedup  on  the  Cray  T3D  and  T3E  architectures,  which 
is  rather  encouraging  for  such  small  problem  sizes.  Note  how  the  single  node 
execution  on  the  IBM  SP  platform  is  a  factor  of  2  to  3  faster  than  the  Cray 
T3E,  but  the  scaling  results  are  poor.  The  figure  indicates  clearly  that  for  this 
hydrodynamic  application,  on  the  order  of  10  processors  of  the  Cray  T3E  and 
IBM  SP  are  needed  to  outperform  a  vectorized  Fortran  90  run  on  one  processor 
of  the  Cray  C90.  Detailed  optimization  strategies  for  all  architectures  shown  in 
Figure  1  (note  the  Pentium  PC  result  and  the  DEC  Alpha  workstation  timing 
m  the  bottom  left  corner)  are  discussed  in  [13]. 
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Fig.  1.  Combined  performance  and  scaling  results  for  running  the  Versatile  Advection 
Code  on  vector  and  parallel  platforms.  See  text  for  details. 
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4  Simulating  Rayleigh- Taylor  instabilities 

To  demonstrate  the  advantages  of  having  a  versatile  source  code  for  simulating 
fluid  flow,  we  consider  what  happens  when  a  heavy  compressible  plasma  is  sit¬ 
ting  on  top  of  a  lighter  plasma  in  an  external  gravitational  field.  Such  a  situation 
is  unstable  as  soon  as  the  interface  between  the  two  is  perturbed  from  perfect 
flatness.  The  instability  is  known  as  the  Rayleigh-Taylor  instability.  Early  ana¬ 
lytic  investigations  date  back  to  a  comprehensive  and  detailed  analysis  given  bv 
Chandrasekhar  [3]. 

The  initial  configuration  is  one  where  two  layers  of  prescribed  density  ra¬ 
tio  (dense  to  light  ratio  of  pd/pi  =  10)  are  left  to  evolve  between  two  planes 
(y  =  0  and  y  =  1),  with  gravity  pointing  downwards  (g  =  -ey  unit  vector). 
The  heavy  plasma  on  the  top  is  separated  from  the  light  plasma  below  it  by 
the  surface  y  =  yo  -I-  esin(/c*a:)sin(A:j2).  Initially,  both  are  at  rest  with  v  =  0, 
and  the  thermal  pressure  is  set  according  to  the  hydrostatic  balance  equation 
(centered  differenced  formula  dp/dy  =  —p).  Boundary  conditions  make  top  and 
bottom  perfectly  conducting  solid  walls,  while  the  horizontal  directions  are  pe¬ 
riodic.  We  then  exploit  the  options  available  in  VAC  to  see  how  the  evolution 
changes  when  going  from  two  to  three  spatial  dimensions,  and  what  happens 
when  magnetic  fields  are  taken  along.  All  calculations  are  done  on  a  Cray  J90, 
where  we  preprocess  the  code  to  Fortran  90  for  single-node  execution. 

4.1  Two-dimensional  simulations 

Figure  2  shows  the  evolution  of  the  density  in  two  two-dimensional  simulations 
without  and  with  an  initial  horizontal  magnetic  field  B  =  O.lCx.  Both  simulations 
are  done  on  a  uniform  100  x  100  square  grid,  and  the  parameters  for  the  initial 
separating  surface  are  yo  =  0.8,  e  =  0.05,  and  K  =  2-k  (there  is  no  2  dependence 
in  2D).  The  data  is  readily  analysed  using  IDL. 

In  both  cases,  the  heavy  plasma  is  redistributed  in  falling  spikes  or  pillars,  also 
termed  Rayleigh-Taylor  ‘fingers’,  pushing  the  lighter  plasma  aside  with  pressure 
building  up  underneath  the  pillars.  However,  in  the  ideal  MHD  case,  the  frozen-in 
field  lines  are  forced  to  move  with  the  sinking  material,  so  it  gets  wrapped  around 
the  pillars.  The  extra  magnetic  pressure  and  tension  forces  thereby  confine  the 
falling  dense  plasma  and  slow  down  the  sinking  and  mixing  process.  In  fact,  since 
we  took  the  initial  displacement  perpendicular  to  the  horizontal  magnetic  field, 
we  effectively  maximized  its  stabilizing  influence. 

In  [3],  the  linear  phase  of  the  Rayleigh-Taylor  instability  in  both  hydrody¬ 
namic  and  magnetohydrodynamic  incompressible  fluids  is  treated  analytically. 
The  stabilizing  effect  of  the  uniform  horizontal  magnetic  field  is  evident  from 
the  expression  of  the  growthrate  n  as  a  function  of  the  wavenumber 

=  (5) 

Pd+Pl  2Tt{pd  +  Pi)  ^  ^ 

Hence,  while  the  shortest  wavelength  perturbations  are  the  most  unstable  ones 
in  hydrodynamics  (H  =  0),  all  wavelengths  below  a  critical  Xcnt  =  5^ /g{pa  - pi) 
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simulated  in  two  spatial  dimensions,  in  a  hydrody- 
Tthe  “^g"®t°hydrodynamic  (right)  case.  The  logarithm  of  the  density  an^ 

the  magnetohydrodynamic  case,  also  the  magnetic  field  lines,  are  plotted. 
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Fig.  2.  Rayleigh-Taylor  instability  simulated  in  two  spatial  dimensions,  in  a  hydrody¬ 
namic  (left)  and  magnetohydrodynamic  (right)  case.  The  logarithm  of  the  density  and, 
in  the  magnetohydrodynamic  case,  also  the  magnetic  field  lines,  are  plotted. 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


are  effectively  suppressed  by  a  horizontal  magnetic  field  of  strength  B.  Similarly, 
our  initial  perturbation  with  A  =  2w/k^  =  1  will  be  stabilized  as  soon  as  the 
magnetic  field  surpasses  a  critical  field  strength  Bern  =  s/gKPd  -  Pi)  0.95. 

The  simulations  confirm  and  extend  these  analytic  findings:  the  predicted 
growthrate  can  be  checked  (noting  that  our  simulations  are  compressible),  while 
the  further  non-linear  evolution  can  be  investigated.  The  discrete  representation 
of  the  initial  separating  surface  causes  intricate  small-scale  structure  to  develop 
in  the  simulation  at  left  of  Figure  2.  This  is  consistent  with  the  fact  that  in  a 
pure  hydrodynamic  case,  the  shortest  wavelengths  are  the  most  unstable  ones. 
Naturally,  the  simulation  is  influenced  by  numerical  diffusion,  while  the  periodic 
boundary  conditions  and  the  initial  state  select  preferred  wavenumbers.  The 
suppression  of  short  wavelength  disturbances  in  the  MHD  case  is  immediately 
apparent,  since  no  small-scale  structure  develops.  The  simulation  at  right  has 
an  initial  plasma  beta  (ratio  of  gas  to  magnetic  pressure  forces)  of  about  400. 
For  higher  plasma  beta  yet,  the  MHD  case  will  resemble  the  hydrodynamic 
simulation  more  closely,  while  a  stronger  magnetic  field  (B  =  Ci)  suppresses  the 
development  of  the  instability  entirely,  as  theory  predicts. 

Note  also  how  the  falling  pillars  develop  a  mushroom  shape  (left  frames)  as  a 
result  of  another  type  of  instability  caused  by  the  velocity  shear  across  their  edge: 
the  Kelvin-Helmholtz  instability.  The  lighter  material  is  swept  up  in  swirling 
patterns  around  the  sinking  spikes.  In  the  MHD  simulation  (right  frames)  the 
Kelvin-Helmholtz  instability  does  not  develop  due  to  the  stabilizing  effect  of  the 
magnetic  field.  Typically  however,  both  instabilities  play  a  crucial  role  in  various 
astrophysical  situations.  Two  dimensional  MHD  simulations  of  Rayleigh-Taylor 
instabilities  in  yoimg  supernova  remnants  [5]  demonstrate  this,  and  confirm  the 
basic  effects  evident  from  Figure  2:  magnetic  fields  get  warped  and  amplified 
around  the  ‘fingers’.  General  discussions  of  these  and  other  hydrodynamic  and 
magnetohydrodynamic  instabilities  are  found  in  [3]. 


4.2  Three-dimensional  simulations 

In  Figure  3,  we  present  a  snapshot  of  a  hydrodynamical  calculation  in  a  3D 
50x50x50  unit  box,  where  the  initial  configuration  has  both  =  2it  and  k.  = 
27r.  With  gravity  downwards,  we  look  into  the  box  from  below.  On  two  vertical 
cuts,  we  show  at  time  t  =  2  (i)  the  logarithm  of  the  density  in  a  color  scale 
and  (ii)  the  streamlines  of  the  velocity  field,  colored  according  to  the  (logarithm 
of  the)  density.  The  cuts  are  chosen  to  intersect  the  initial  separating  surface 
between  the  heavy  and  the  light  plasma  at  its  extremal  positions  where  the 
motion  is  practically  two-dimensional.  3D  effects  are  readily  identified  by  direct 
comparison  with  the  two-dimensionad  hydrodynamic  calculation.  The  time  series 
of  the  3D  data  set  has  been  analysed  using  AVS  (a  video  is  made  with  AVS  to 
demonstrate  how  density,  pressure  and  velocity  fields  evolve  during  the  mixing 
process). 

Figure  4  shows  the  evolution  of  a  three-dimensional  MHD  calculation  at  times 
f.  =  1  and  <  =  2.  We  show  an  isosurface  of  the  density  (at  1%  above  the  initial 
value  for  p^),  colored  according  to  the  thermal  pressure.  A  cutting  plane  also 
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t  =  2.00 


Fig.  3.  Rayleigh-Taylor  instability  in  3D,  purely  hydrodynamic.  We  show  streamlines 
(left)  and  density  contours  (right)  in  two  vertical  cutting  planes. 


shows  the  vertical  stratification  of  the  thermal  pressure.  Note  the  change  in  the 
initial  configuration  {kx  =  67r  and  =  47r,  with  yo  =  0.7):  more  and  narrower 
spikes  are  seen  to  grow  and  to  split  up.  The  AVS  analysis  of  the  full  time  series 
shows  how  droplets  form  at  the  tips  of  the  falling  pillars,  which  seem  to  expand 
horizontally  to  a  critical  size  before  continuing  their  fall.  At  the  same  time,  the 
magnetic  field  gets  wrapped  around  the  falling  pillars.  Figure  4  nicely  confirms 
that  places  where  spikes  branch  into  narrower  ones  correspond  to  places  with  ex¬ 
cess  pressure  underneath.  Similar  studies  of  incompressible  3D  ideal  MHD  cases 
are  found  in  [6].  They  confirm  that  strong  tangential  fields  suppress  the  growth 
as  expected  from  theoretical  considerations,  while  the  Rayleigh-Taylor  instabil¬ 
ity  acts  to  amplify  magnetic  fields  locally.  In  such  magnetic  fluids,  parameter 
regimes  exist  where  secondary  Kelvin-Helmholtz  instabilities  develop,  just  as  in 
the  hydrodynamic  situation  of  Figure  3  (note  the  regions  of  strong  vorticity  in 
the  streamlines). 
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t=  2.00  p:  1.01 


Fig.  4.  3D  MHD  Rayleigh-Taylor  instability.  At  two  consecutive  times,  an  isosurface 
of  the  density  is  colored  according  to  the  thermal  pressure.  The  thermal  pressure  is 
also  shown  in  a  vertical  cut. 


5  Conclusions 

We  have  developed  a  powerful  tool  to  simulate  magnetized  fluid  dynamics.  The 
Versatile  Advection  Code  runs  on  many  platforms,  from  PC’s  to  supercomputers 
including  distributed  memory  architectures.  The  rapidly  maturing  HPF  compil¬ 
ers  can  yield  scalable  parallel  performance  for  general  fluid  dynamical  simula¬ 
tions.  Clearly,  the  scaling  and  performance  of  VAC  make  high  resolution  3D 
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simulations  possible,  and  detailed  investigations  may  broaden  our  insight  in  the 
intricate  dynamics  of  magneto-fluids  and  plasmas. 

We  presented  simulations  of  the  Rayleigh- Taylor  instability  in  two  and  three 
spatial  dimensions,  with  and  without  magnetic  fields.  VAC  allows  one  to  do 
all  these  simulations  with  a  single  problem  setup,  since  the  equations  to  solve 
and  the  dimensionality  of  the  problem  is  simply  specified  in  a  preprocessing 
ph^e.  Data  analysis  can  be  done  using  a  variety  of  data  visualization  packages, 
including  IDL  and  AVS  as  demonstrated  here.  In  the  future,  we  plan  to  use  VAC 
to  investigate  challenging  astrophysical  problems,  like  winds  and  jets  emanating 
from  stellar  objects,  magnetic  loop  dynamics,  accretion  onto  black  holes,  etc. 

Website  info  on  the  code  is  available  at  http :  //www .  f  ys .  ruu .  nl/'toth/  and 

at  http :  //WWW .  f  ys .  ruu . nl/'mpr/.  MPEG-animations  of  various  test  problems 
can  also  be  found  there. 
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Abstract.  We  introduce  the  parallel  grid  manipulations  needed  in  the 
Earth  Science  applications  currently  being  implemented  at  the  Data  As¬ 
similation  Office  (DAO)  of  the  National  Aeronautics  and  Space  Admin¬ 
istration  (NASA).  Due  to  real-time  constraints  the  DAO  software  must 
run  efficiently  on  parallel  computers.  Numerous  grids,  structured  and 
unstructured  are  employed  in  the  software. 

The  DAO  has  implemented  the  PILGRIM  library  to  support  multiple 
grids  and  the  various  grid  transformations  between  them,  e.g.,  interpo¬ 
lations,  rotations,  prolongations  and  restrictions.  It  allows  grids  to  be 
distributed  over  an  array  of  processing  elements  (PEs)  and  manipulated 
with  high  parallel  efficiency.  The  design  of  PILGRIM  closely  follows  the 
DAO’s  requirements,  but  it  can  support  other  applications  which  em¬ 
ploy  certain  types  of  grids.  New  grid  definitions  can  be  written  to  support 
still  others.  Results  illustrate  that  PILGRIM  cam  solve  grid  mamipulation 
problems  efficiently  on  parallel  platforms  such  as  the  Cray  T3E. 

1  Introduction 

The  need  to  dtscretize  coiitiniiou.s  models  in  order  to  solve  scientific  problems 
gives  rise  to  finite  grids  sets  of  j^oints  at  which  prognostic  variables  are  sought. 
So  prevalent  is  the  use  of  grids  in  science  that  it  is  possible  to  forget  that  a. 
computer-calculated  solution  is  not  the  solution  to  the  original  problem  but 
rather  of  a  discretized  representation  of  the  original  problem,  and  moreover  is 
only  an  approximate  solution,  due  to  finite  precision  arithmetic.  Grids  are  ubiq¬ 
uitous  where  analytical  solutions  to  continuous  problems  are  not  obtainable,  e.g., 
the  solution  of  many  differential  equations. 

Classically  a  structured  grid  is  chosen  a  priori  for  a  given  problem.  If  the 
quality  of  the  solution  is  not  acceptable,  then  the  grid  is  made  finer,  in  order  to 
better  appro.ximate  the  continuous  problem. 

Foi  some  time  the  practicality  of  unstructured gx\ds  has  also  been  recognized. 
In  such  grids  it  is  ]30.ssible  to  cluster  points  in  regions  of  the  domain  which  require 
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higher  resolution,  while  retaining  coarse  resolution  in  other  parts  of  the  domain. 
Unstructured  grids  are  often  employed  in  device  simulation  [1],  computational 
fluid  dynamics  [2],  and  even  in  oceanographic  models  [3].  Although  these  grids 
are  more  difficult  to  lay  out  than  structured  grids,  much  research  has  been 
done  in  generating  them  automatically  [4],  In  addition,  once  the  grid  has  been 
generated,  there  numerous  methods  and  libraries  are  available  to  adaptively 
refine  the  mesh  [5]  to  provide  a  more  precise  solution. 

Furthermore,  the  advantages  of  multiple  grids  of  varying  resolutions  for  a 
given  domain  have  been  recognized.  This  is  best  known  in  the  Multigrid  tech¬ 
nique  [6]  in  which  low  frequency  error  components  of  the  discrete  solution  are 
eliminated  if  values  on  a  given  grid  are  restricted  to  a  coarser  grid  on  which  a 
smoother  is  applied.  But  multiple  grids  also  find  application  other  fields  such  as 
speeding  up  graph  partitioning  algorithms  [7], 

An  additional  level  of  complexit\'  has  arisen  in  the  last  few  years:  many  con¬ 
temporary  scientific  problems  must  be  decomposed  over  an  array  of  processing 
elements  (or  PEs)  in  order  to  obtain  a  solution  in  an  expedient  manner.  Depend¬ 
ing  on  the  parallelization  technique,  not  only  the  work  load  but  also  the  grid 
itself  may  be  distributed  over  the  PEs,  meaning  that  different  parts  of  the  data 
reside  in  completely  different  memory  areas  of  the  parallel  machine.  This  makes 
the  programming  of  such  an  application  much  more  difficult  for  the  developer. 

The  Goddard  Earth  Observing  System  (GEOS)  Data  Assimilation  System 
(DAS)  software  currently  being  developed  at  the  Data  Assimilation  Office  (DAO) 
IS  no  exception  to  the  list  of  modern  grid  applications.  GEOS  DAS  u.ses  observa¬ 
tional  data  with  systematic  and  random  errors  and  incomplete  global  coverage 
to  estimate  the  complete,  dynamic  and  constituent  state  of  the  global  earth 
system.  The  GEOS  DAS  consists  of  two  main  components,  an  atmospheric  Gen¬ 
eral  Circulation  Model  (GCM)  [8]  to  predict  the  time  evolution  of  the  global 
earUi  system  and  a  Phj'sical-space  Statistical  Analysis  Scheme  (PSAS)  [9]  to 
periodically  incorporate  observational  data. 

At  least  three  distinct  grids  are  being  employed  in  GEOS  DAS:  an  observa¬ 
tion  grid  —  an  unstructured  grid  of  points  where  physical  quantities  measured 
by  instruments  or  satellites  are  as.sociated  —  a  structured  geophysical  grid  of 
points  spanning  the  earth  at  uniform  latitude  and  longitude  locations  where 
progno.stic  quantities  are  determined,  and  a  block-structured  computational  grid 
which  may  be  stretched  in  latitude  and  longitude.  Each  of  these  grids  has  a 
different  structure  and  number  of  constituent  points,  but  there  are  numerous 
interactions  between  them.  Finally  the  GEOS  DAS  application  is  targeted  for 
distributed  memory  architectures  and  employs  a  message-passing  paradigm  for 
the  communication  between  PEs. 

In  this  document  we  describe  the  design  of  PILGRIM  (Fig.  1),  a  parallel  li¬ 
brary  for  grid  manipulations,  which  fulfills  the  requirements  of  GEOS  DAS.  The 
design  of  PILGRIM  is^object-orievied  [10]  in  the  sense  that  it  is  modular,  data  is 
encapsulated  in  each  design  layer,  operations  can  be  overloaded,  and  different  in¬ 
stantiations  of  grids  can  coexist  simultaneously.  The  library  is  realized  in  Fortran 
90,  which  allows  the  necessary  software  engineering  techniques  such  as  modules 
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Fig.  1.  PILGRIM  assumes  the  existence  of  fundamental  communication  primitives  such 
as  the  Message-Passing  Interface  (MPI)  and  optimized  Basic  Linear  Algebra  Subrou¬ 
tines  (BLAS).  PILGRIM’s  first  layer  contains  routines  for  communication  as  well  as  for 
decomposing  the  domain  and  packing  and  unpacking  sub-regions  of  the  local  domain. 
Above  this  is  a  sparse  linear  algebra  layer  which  performs  basic  sparse  matrix  opera¬ 
tions  for  grid  transformations.  Abo's'e  PILGRIM,  modules  define  and  support  different 
grids.  Currently  only  the  grids  needed  in  GEOS  DAS  are  implemented,  but  the  further 
modules  could  be  designed  to  support  yet  other  grids. 


and  derived  data  types,  while  keeping  in  line  with  other  Fortran  developments 
at  the  DAO.  The  communication  layer  is  implemented  using  MPI  [11]:  however 
the  communication  interfaces  defined  in  PILGRIM’s  primary  layer  could  con¬ 
ceivably  be  implemented  with  other  message-passing  libraries  such  a.s  PVM  [12], 
or  with  other  paradipns,  e.g.,  Cray  SHMEM  [13]  or  with  shared-memory  prim¬ 
itives  which  are  available  on  shared- memory  machines  like  the  SGI  Origin  or 
■SUN  Enterpri,se. 

This  document  is  structured  in  a  bottom-up  fashion.  Reasonable  design  a.s- 
sumptions  are  made  in  Sect.  2  in  order  to  ease  the  implementation.  The  layer 
for  communication,  decompositions,  and  buffer  packaging  is  discussed  in  Sect.  3 
The  sparse  linear  algebra  layer  is  specified  in  Sect.  4.  The  plug-in  grid  modules 
are  defined  in  Sect.  5  to  tlie  degree  necessary  to  meet  the  requirements  of  GEOS 
DAS.  In  Sect.  6  .some  examples  and  prototype  benchmarks  are  presented  for  the 
interaction  of  all  the  components.  I-  inally  we  summarize  our  work  in  Sect.  7. 

2  Design  Assumptions 

A  literature  search  was  the  first  step  taken  in  the  PILGRIM  design  process  in 
order  to  find  public  domain  libraries  which  might  be  sufficient  for  the  DAO's 
requirements  [14].  Surprisingly,  none  of  the  common  parallel  libraries  for  the 
solution  of  sparse  matrix  problems,  e.g..  PET, Sc  [15],  Aztec  [16],  PLUMP  [17], 
et  al.,  was  sufficient  for  our  purposes.  These  libraries  all  try  to  make  the  parallel 
implementation  transparent  to  the  aijplication.  In  particular,  the  application  is 
not  supposed  to  know  how  the  clai.a  are  actually  distributed  over  the  PEs. 


613 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


This  trend  m  libraries  is  not  universally  applicable  for  the  simple  reason 
that  if  an  application  is  to  be  parallelized,  the  developers  generally  have  a  good 
idea  of  how  the  underlying  data  should  be  distributed  and  manipulated.  Experi¬ 
ence  has  shown  us  that  hiding  complexity  often  leads  to  poor  performance,  and 
the  developer  often  re,sorts  to  workarounds  to  make  the  system  perform  in  the 
manner  she  or  he  envisions.  If  the  developer  of  a  parallel  program  is  capable  of 
deciding  on  the  proper  data  distribution  and  manipulation  of  local  data  then 
those  decisions  need  to  be  supported. 

In  order  to  minimize  the  scope  of  PILGRIM,  other  simplifying  assumptions 
were  made  about  the  way  the  library  will  be  used.  ^ 


.  The  local  portion  of  the  distributed  grid  array  is  assumed  to  be  a  contiguous 
section  of  memory.  The  local  array  can  have  any  rank,  but  if  the  rank  is 
peater  than  one,  the  developer  must  assure  that  no  gaps  are  introduced 
into  the  actual  data  representation,  for  example,  by  packing  it  into  a  1-D 
array  if  necessary. 

2.  Grid  transformations  are  assumed  to  be  sparse,  i.e.,  each  of  the  values  on  one 
grid  IS  determined  from  a  linear  combination  of  only  a  few  values  from  the 
other  grid.  The  linear  transformation  corresponds  to  a  sparse  matrix  with  a 
predictable  number  of  non-zero  entries  per  row.  This  assumption  is  realistic 
lor  the  localized  interpolations  used  in  GEO,S  DA.S. 

3.  At  a  high  lepl,  the  application  can  access  data  through  global  indices  i  e 
the  indices  of  the  original  undistributed  problem.  However,  at  the  level  where’ 
most  imputation  is  performed,  the  application  needs  to  work  with  local  in- 

ices  (r^iging  from  one  to  the  total  number  of  entries  in  the  local  contiguous 
array).  The  information  to  perform  global-to-local  and  local-to-global  map¬ 
pings  must  be  contained  in  the  data  structure  defining  the  grid  However  it 
is  assumed  that  these  mappings  are  seldom  performed,  e.g.,  at  the  beginning 
AM of  execution,  and  these  mappings  need  not  be  efficient. 

4.  All  decomposition-related  information  is  replicated  on  all  PEs. 


These  assumptions  are  significant.  The  first  avoids  the  introduction  of  an 
opaque  type  for  data  and  allows  the  application  to  manipulate  the  local  data,  as  it 
"^^ta  are  contained  in  a  simple  data  structure  generallv 
allows  higher  performance  than  an  implementation  which  buries  the  data  inside  'a 
derived  type  The  .second  a.ssumption  ensures  tha.t  the  matrix  transformation  are 
not  memory  limited.  The  third  implies  that  most  of  the  calculation  is  performed 
on  he  data  m  a  local  fashion.  In  GEO.S  DAS  it  is  fairly  straightforward  to  run 
in  this  mode;  however,  it  might  not  be  the  ca.se  in  other  applications.  The  last 
assumption  assures  that  every  PE  knows  about  the  entire  data  decomposition 


Communication  and  Decomposition  Utilities 


^^slaAjr  communication  routines  are  isolated,  and  basic  functionalitv  is  pro 
ed  for  defining  and  using  data  decompositions  as  well  as  for  moving  section 
of  data,  arrays  to  and  from  buffers.  ‘ 
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The  operations  on  data  decompositions  are  embedded  in  a  Fortran  90  module 
which  also  supplies  a  generic  DecompType  to  describe  a  decomposition.  Any 
instance  of  DecompType  is  replicated  on  all  PEs  such  that  every  PE  has  access  to 
information  about  the  entire  decomposition.  The  decomposition  utilities  consist 
of  the  following: 


DecompRegularld 

Create  a  1-D  blockwise  data  decomposition 

DecompR,egular2d 

t  reate  a  2-U  block- block  data  decomposition 

DecompR,egular3d 

C  reate  a  .3-D  block-block-block  data  decomposition 

Decompirregular 

t.  reate  an  irregular  data  decomposition 

DecompCopy 

C'l-eate  new  decomposition  with  contents  of  another 

DecompPermute 

Permute  PE  assignment  in  a  given  decomposition 

DecompFree 

Free  a  decomposition  and  the  related  memory 

DecompGlobalToLocal  1  d 

Map  global  1-D  index  to  local  (pe. index) 

DecompGlobalToLocal2d 

Map  global  2-U  index  to  local  (pe, index) 

DecompLocalToG  lobal  Id 

JMap  local  (pe, index)  to  global  1-D  index 

DecompLocalToGloballd 

Map  local  (pe, index)  to  global  2-D  index 

Using  the  Fortran  90  overloading  feature,  the  routines  which  create  new 
decompositions  are  denoted  by  DecompCreate.  Similarly,  the  1-D  and  2-D  global- 
to-local  and  local-to-global  mappings  are  denoted  by  DecompGlobalToLocal  and 
DecompLocalToGlobal.  resulting  in  a  total  of  five  fundamental  operations. 

Communication  primitives  are  confined  to  this  la.3-er  because  it  may  be  nec¬ 
essary  at  some  point  to  implement  them  with  a  message- passing  library  other 
than  MPI  .such  as  PVM  or  SHMEM.  or  even  with  shared-memory  primitives 
such  as  those  on  the  SGI  Origin  (the  principle  platform  at  the  DAO).  Thus  it  is 
wise  to  encapsulate  all  message-passing  into  one  Fortran  90  module.  For  brevity, 
only  the  overloaded  functionality  is  presented: 


Parinit 

Initialize  the  parallel  code  segment 

ParExit 

Exit  from  the  parallel  code  segment 

ParSplit 

bplit  ))ai-allei  code  .segment  into  two  groups 

ParMerge 

Merge  two  code  segments 

ParScatter 

Scatter  global  arra\-  to  given  data  decomposition 

ParClather 

(lather  from  data  decomposition  to  global  array 

ParB  eginTra  nsfer 

Begin  asynchronous  data  transfer 

ParEndTransfer 

End  asynchronous  data  transfer 

l-’arExchangeVector 

Iranspose  block-distributed  vector  over  all  PEs 

Par  Redistribute 

Redistribute  one  data  decomposition  to  another 

In  order  to  perform  calculations  locally  on  a  given  PE  it  is  often  nece.ssary 
to  ghost  adjacent  regions,  that  i.s,  .send  boundary  regions  of  the  local  domain 
to  adjacent  PEs.  To  this  end  a  module  has  been  constructed  to  move  ghost 
regions  to  and  from  buffers.  Ihe  buffers  can  be  transferred  to  other  PEs  with 
the  communication  primitives  such  as  ParBeginTransf  er  and  ParEndTransf  er. 
Currently  the  buffer  module  contains  the  following  non-overloaded  functionality: 
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Buffei PackGhost^dReal  Pack  a  2-D  array  sub-region  into  buffer 
BufferLinpackGhost2dReal  Unpack  buffer  into  2-D  array  sub-region 
BufferPa.ckGhost3dR.eal  Pack  a  3-D  array  sub-region  into  buffer 
BufferprpackGhostSdReal  Unpack  buffer  into  3-D  array  sub-region 
BufferPackSparseReal  P^d<  specified  entries  of  vector  into  buffer 
BufferUnpackSparseReal  | Unpack  buffer  into  specified  entries  nf  vp<-tor 


In  this  module,  as  in  mo.st  other.s,  the  local  coordinate  indices  are  used  instead 
of  global  indices.  Clearly  this  puts  responsibility  on  the  developer  to  keep  track 
of  the  indices  which  correspond  to  the  ghost  regions.  In  GEOS  DAS  this  turns 
out  to  be  fairly  straightforward. 


4  Sparse  Linear  Algebra 


The  concept  of  tranisforming  one  grid  to  another  involves  interpolating  the  val¬ 
ues  defined  on  one  grid  at  grid-points  on  another.  These  values  are  stored  as 
contiguous  vectors  with  a  given  length.  1 . .  .7Vn„„,,  and  distribution  defined  by 
the  grid  decomposition  (although  the  vector  might  actuallv  repre.sent  a  multi- 
dimeinsional  array  at  a  higher  level).  Thus  the  sparse  linear  algebra  layer  funda¬ 
mental!}'  consists  of  a  facility  to  jierform  linear  transformations  on  distributed 
vectors. 

A  parallel  sparse  linear  algebra,  packages,  e.g.,  PETSc  [15]  and 

Aztec  [16],  the  linear  transformation  is  stored  in  a  distributed  sparse  matrix 
ormat.  Lnlike  tliose  libraries,  however,  local  indices  are  used  when  referring  to 
individual  matrix  entries,  although  the  mapping  DecompGlobalToLocal  can  be 
used  to  translate  from  global  to  local  indices.  In  addition,  the  application  of  the 
linear  transformation  is  a  matrix-vector  multiplication  where  the  matrix  is  not 
necessarily  square,  and  the  resulting  vector  may  be  distributed  differentlv  than 
the  original. 

There  are  many  approaches  to  st.oring  distributed  sparse  matrices  and  per- 
formmg  a  the  matrix-vector  product.  PILGRIM  u.ses  a  format  .similar  to  that 
described  in  [li],  which  is  optimal  if  the  numbei'  of  non-zero  entries  per  row  is 
constant. 

Assumption  3  in  Sect.  2  implies  that  the  matrix  definition  is  not  time- 
consuming  In  GEOS  DAS  the  temjDlate  of  any  given  interpolation  is  initialized 
once,  but  the  interpolation  itself  is  performed  repeatedlv.  Thus  relatively  little 
attention  has  been  paid  to  the  optimization  of  the  matrix  creation  and  definition 
llie  basic  operations  for  creating  and  storing  matrix  entries  are: 


_ I  )estroy  a  sparse  matrix 

SparselirsertEiitries  jlnsert  entries  replicated  on  all  PF,.; 
SparselnsertLocalEntriesHnsert  entries  of  local  PF: 


Two  scenarios  for  inserting  eni  ries  are  supported.  In  the  first  scenario,  everv 
PE  inserts  all  matrix  entries.  Thus  every  argument  of  the  corresponding  routine. 
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SparselnsertEntries,  is  replicated.  The  local  PE  picks  up  only  the  data  which 
it  needs,  leaving  other  data  to  the  appropriate  PEs.  This  scenario  is  the  easiest 
to  program  if  the  sequential  code  \  ersion  is  u.sed  as  the  code  base. 

In  the  second  scenario  the  domain  is  partitioned  over  the  PEs,  meaning  that 
each  PE  is  responsible  for  a  clisjoiiii  subset  of  the  matrix  entries,  and  the  matrix 
generation  is  performed  in  parallel.  ( 'learly  this  is  the  more  efficient  scenario.  The 
corresponding  routine.  Sparse InsertLocalEntries  assumes  that  no  two  PEs 
try  to  add  the  same  matrix  entry.  However,  it  does  not  assume  that  the  all  matrix 
entries  reside  on  the  local  PE.  and  it  will  perform  the  necessary  communication 
to  put  the  matrix  entries  in  their  correct  locations. 

The  efficient  application  of  the  matrix  to  a  vector  or  group  of  vectors  is 
crucial  to  the  overall  performance  of  GEOS  DAS,  since  the  linear  transformations 
are  performed  continually  on  assimilation  runs  for  days  or  weeks  at  a  time. 
The  most  common  transformation  is  between  three-dimensional  arrays  of  two 
different  grids  which  describe  global  atmospheric  quantities  such  as  wind  velocity 
or  temperature.  One  3-D  array  might  be  correspond  to  the  geophysical  grid  which 
covers  the  globe,  while  another  might  be  the  computational  grid  which  is  more 
appropriate  for  the  dynamical  calculation.  The  explicit  de.scription  of  such  a  3-D 
transformation  might  be  prohibitive  in  terms  of  memory.  But  fortunately,  this 
transformation  only  has  dependencies  in  two  of  the  three  dimensions  as  it  acts 
on  2-D  horizontal  cross-sections  independently. 

To  fulfill  the  assumptions  in  .Sect.  2,  a  2-D  array  is  considered  a  vector  x.  Us¬ 
ing  this  representation  the  transformations  become  parallel  matrix-vector  mul¬ 
tiplications,  which  can  be  performed  with  one  of  the  following  two  operations: 


SparseMatVecMult 

Perform  y  •«-  q.4j:  -|-  jdy  \ 

SparseMatTransX'ecMult 

Perform  y  <—  q  .4^  x  +  fiy 

In  order  to  transform  several  arrays  simultaneously,  the  arrays  are  grouped 
into  multiple  vectors,  that  is.  into  a  n  x  m  matrix  where  n  is  the  length  of  the 
vector  (number  of  values  in  the  2-D  array),  and  in  is  the  number  of  vectors.  The 
following  matrix-matrix  and  matrix-transpose-matrix  multiplications  can  group 
messages  in  such  a  way  as  to  drastically  minimize  latencies  and  utilize  BLAS-2 
operations  instead  of  BLA,S-1: 


S  p  arse  M  at  M  at  M  u  1 1 

Perform  Y  <—  q.4.Y  -t-  iSY 

SparseMatTransMatM  ult 

Perform  q.4''  A'  -H  dY 

The  distributed  representation  of  the  matrix  contains,  in  addition  to  the  ma¬ 
trix  information  itself,  space  for  the  communication  pattern.  Upon  entering  any 
one  of  the  four  matrix  operations,  t  lie  the  matrix  is  checked  for  new  entries  which 
may  have  been  added  since  its  Iasi  application.  If  the  matrix  has  been  modified, 
the  operation  first  generates  I  lie  <.onimunicatioii  patt  ern  —  an  optimal  map  of 
the  information  which  has  to  be  iwchanged  between  PEs  —  before  performing 
the  matri.x  multiplication.  Thi.s  is  a  lairly  expensive  operation,  but  in  GEOS  DAS 
it  only  needs  to  be  done  once  when  the  matrix  is  first  defined.  Subsequently,  the 
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repetitively  in  the  most  efficient  manner 

5  Supported  Grids 

The  grid  data  structure  describes  a  set  of  gnd- points  and  their  decomposition 
Tl"e  arfdT^  of  PEs  as  well  as  other  information,  such  as  the  size  of  the  domain, 
on  !n  Itself  does  jiot  contain  actual  data  and  can  be  replicated 

on  all  PEs  due  to  its  minimal  memory  requirements.  The  data  reside  in  arravs 

strTcW  TI in  the  grid  data 
structure.  There  is  no  Inmtation  on  how  the  application  accesses  and  manipulates 

hire  ^r!  employed  in  GEOS  DAS  are  des^cribed 

e  but  others  are  conceivable  and  could  be  supported  bv  PILGRIM  without 
modifications  to  the  library.  '  wirnout 

The  latitude-longitude  grid  delines  a  lat-lon  coordinate  svstem  -  a  regular 

all  DoTuTshf !  CO?  f  "  Siven  latitude  and 

all  points  in  a  column  a  given  longitude.  The  grid  encompasses  the  entire  earth 

Horn  -TT  to  TT  longitudinally  and  from  -7r/2  to  7t/2  in  latitude. 


''q  Lilltiudc 

i  '  . 


Fig.  2.  GEOS  DAS  uses  a  column  decomposition  of  data  (left),  also  termed  a  "checker 

(  g  ).  The  width  and  breadth  of  a  column  can  be  variable,  although  generallv  an 
approximately  equal  number  of  point.s  are  assigned  to  every  PE. 


The  decomposition  of  this  grid  i,s  a  “checkerboard  "  (Fig.  2)  because  the 

|  »ll  levels  of  the  colon.',,  dee- 

^  ‘  le  2  D  clef  omposK ion  of  the  horizontal  cros.s-.section.  This  decom- 

foTe,;°h  S’ .oir'”  ’  ''"r-''*-™-'  -“Sl-  °f  -  H.  «  .,ot  „ececea,v 

"1““'  -  »«l  >!■'»  some  freecio,.,  for  load 
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TYPE  LatLonGridType 
TYPE  (DecompType) 

INTEGER 
INTEGER 
REAL 
REAL 
REAL 

REAL, POINTER 
REAL, POINTER 

END  TYPE  LatLonGridType 

This  grid  suffices  to  describe  both  the  GEOS  DAS  computational  grid  used 
for  dynamical  calculations  and  tl)e  geophysical  grid  in  which  the  prognostic 
variables  are  sought.  The  former  makes  use  of  the  parameters  Tilt,  Rotation 
and  Precession  to  describe  its  \  iew  of  the  earth  (Fig.  3).  and  the  dLat  and 
dLon  grid  box  sizes  to  de.scribe  the  grid  stretching.  The  latter  is  defined  by  the 
normal  geophysical  values  for  Tilt.  Rotation  and  Precession  =  (^,0,0)  and 
uniform  dLat  and  dLon. 

The  observation  grid  data  structure  describes  observation  points  over  the 
globe,  as  described  by  their  lat-lon  coordinates.  In  contrast  to  the  lat-lon  grid,  the 
point  grid  decomposition  is  inherently  one-dimensional  since  there  no  structure 
to  the  grid. 

TYPE  ObsGridType 
TYPE  (DecompType) 

INTEGER 

END  TYPE  ObsGridType 

The  data  corresponding  to  this  grid  data  structure  is  a  set  of  vectors,  one 
for  the  observation  values  and  sevci'al  for  attributes  of  those  values,  such  as  the 
latitude,  longitude  and  level  a(.  which  an  observation  was  taken. 


:  :  Decomp  !  Decomposition 

:  :  Nobservations  !  Total  points 


Decomp  !  Decomposition 

ImGlobal  !  Global  Size  in  X 

JmGlobal  !  Global  Size  in  Y 

Tilt  !  Tilt  of  remapped  NP 

Rotation  !  Rotation  of  remapped  NP 
Precession  !  Precession  of  remapped  NP 
dLat ( : )  !  Latitudes 

dLon(:)  !  Longitudes 


6  Results 

An  example  of  a  non-trivial  transformation  emploj'ed  in  atmospheric  science  ap¬ 
plications  is  grid  rotation  [18].  Computational  instabilities  from  finite  difference 
schemes  can  arise  in  the  polar  regions  of  the  geophysical  grid  when  a  strong 
cross-polar  flow  occurs.  B\'  placing  the  pole  of  the  computational  grid  to  the 
geographic  equator,  however,  the  instability  near  the  geographic  pole  is  removed 
due  to  the  vanishing  Coriolis  term. 

It  is  generally  accepted  that  the  phj'sical  processes  such  as  tho.se  related  to 
long-  and  short-wave  radiation  can  be  calculated  directly  on  the  geophysical  grid. 
Dynamics,  where  the  numerical  insi  ability  occurs,  needs  to  be  calculated  on  the 
computational  grid.  An  additional  jefinement  involv'es  calculating  the  dynamics 
on  a  rotated  stretched  gvk\.  in  whicli  the  grid-points  are  not  uniform  in  latitude 
and  longitude.  The  LatLonGridType  allow’s  for  both  variable  lat-lon  coordinates 
as  well  as  the  description  of  any  lat-lon  view  of  the  world  where  the  poles  are 
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assigned  to  a  new  geographical  location.  The  grid  rotation  (without  stretching) 
is  depicted  in  Fig.  3. 


Fig.  3.  The  use  of  the  latitude-longitude  grid  (a)  and  (c)  as  the  computational  grid 
results  in  instabilities  at  the  poles  due  to  the  Coriolis  term.  The  instabilities  vanish 
with  on  a  grid  (b)  where  the  pole  has  been  rotated  to  the  equator.  The  computational 
grid  is  therefore  a  lat-lon  grid  (d)  where  the  “poles”  on  the  top  and  bottom  are  in  the 
Pacific  and  Atlantic  Oceans,  respecti^■elv. 


It.  would  be  natural  to  use  the  same  decomposition  for  both  the  geophysical 
and  computational  grids.  It  turns  out.  however,  that  this  approach  disturbs  data 
locality  inherent  to  this  tran.sformation  (Fig.  4).  If  the  application  could  have 
unlimited  freedom  to  choose  the  decomposition  of  the  computational  grid,  the 
forward  and  reverse  grid  rotations  could  e.xhibit  excellent  data  locality,  and  the 
matrix  application  would  be  much  more  efficient.'  Unfortunatelv,  practicality 
limits  the  decomposition  of  both  the  geophysical  and  computational  grids  to  be 
a  checkerboard  decomposition. 

However,  there  are  still  several  degrees  of  freedom  in  the  decomposition, 
namehv  the  number  of  points  on  ea('h  PE  and  the  assignment  of  local  regions  to 
PEs.  W  hile  an  approximately  uniloi  m  number  of  points  per  PE  is  generally  best 
for  the  dynamics  calculation,  the  assignment  of  PEs  is  arbitrary.  The  following 
optimization  is  therefore  applied:  the  potential  communication  pattern  of  a  naive 

'  .4  simply  connected  region  in  one  domain  will  map  to  at  nio.st  two  .simplv  connected 
regions  in  the  othei'. 
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Unpermuted  Communication  Matrix  Permuted  Communication  Matrix 


0 

219  5905 

172 

0 

97 

507 

12* 

-5967  166 

2 
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371 

4 

0 

61- 

.53  5 
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690 
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1 

303 

132 
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53  5731 

690 

2 

1 

303 

132 

0 

3 

•177 

136 

0 

53 
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516 

0 
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172 

0 

97 
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12 

0 

97 

335 

A 
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516  0 
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0 
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0 
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0 

3 
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0 

61 

5941 

172 

0 

183 

5&6r 

166 

2 

3-11 

371 

4 
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12 

0 

61 
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0 
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4 

3 
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.  133 

0 

1 
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1 

54 

5661 . 
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1 
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760 

1 
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Fig.  4.  The  above  matrices  represent  the  number  of  vector  entries  requested  by  a 
PE  (column  index)  from  another  PE  (row  index)  to  ]:>erform  a  grid  rotation  for  one 
72  X  48  horizontal  plane  (i.e.,  one  matrix-vector  multiplication)  on  a  total  of  eight 
PEs.  The  unpermuted  communication  matrix  reflects  the  naive  use  of  the  geophysical 
grid  decomposition  and  PE  assignment  for  the  computational  grid.  The  permuted 
communication  matrix  uses  the  same  decomposition,  except  the  assignment  of  local 
regions  to  PEs  is  permuted.  The  diagonal  entries  denote  data  local  to  the  PE  and 
represent  work  wliich  can  be  overlapped  with  the  asynchronous  communication  involved 
in  fetching  the  non-local  data.  The  diagonal  dominance  of  the  communication  matrix 
on  the  right  translates  into  a  consideralDle  performance  improvement. 


computational  grid  decomposition  is  analyzed  by  adopting  the  decomposition  of 
the  geophysical  grid.  With  a  heuristic  method,  this  analysis  leads  to  a  permuta¬ 
tion  of  PEs  for  the  computational  grid  wdiich  reduces  communication  (Fig.  4). 
The  decomposition  of  the  computational  grid  is  then  defined  as  a  permuted  ver¬ 
sion  of  the  geophysical  grid.  Only  then  is  the  grid  rotation  matrix  defined.  An 
outline  of  the  code  is  as  given  in  Algorithm  1. 

Algorithm  1  (Optimized  Grid  Rotation)  Given  the  geophysical  grid  decom¬ 
position,  find  a  permutation  of  the  PEs  which  will  m.artmtze  the  data  locality 
of  the  geophystcal-to-computational  grid  transformation .  create  and  permute  the 
computation  grid  decomposition,  and  define  the  transformation  in  both  direc¬ 
tions. 

SparseMatrixCreate (  GeoToComp  ) 

SparseMatrixCreate (  . . . ,  CompToGeo  ) 

DecompCreatef  GeoPhysDecomp  ) 

LatLonCreate(  GeoPhysDecomp,  GeoPhysGrid  ) 

AnalyzeGridTransform(  GeoPhysDecomp,  .  Permutation  ) 

DecompCopyC  GeoPhysDecomp.  CompDecomp  ) 

DecompPermute (  Permutation,  CompDecomp  ) 

LatLonCreate(  CompDecomp .  CompGrid  ) 

GridTransf orm(  GeoPhysGrid,  CompGrid,  GeoToComp  ) 

GridTransformC  CompGrid,  GeoPhysGrid,  CompToGeo  ) 

In  GridTransf  orm  the  coordinates  of  one  lat-lon  grid  are  mapped  to  another. 
Interpolation  coefficients  are  determined  by  the  proximity  of  rotated  grid-points 
to  grid-points  on  the  other  grid  (l  ig.  3).  N'arious  interpolation  schemes  can  be 
employed  including  bi-linear  or  bi-eubic:  the  latter  is  emirloyed  in  GEOS  DAS. 
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The  transformation  matrix  can  be  completely  defined  by  tlie  two  grids  —  the 
values  on  those  grids  are  not  necessary. 

Once  the  transformation  matrix  is  defined,  sets  of  grid  values,  such  as  in¬ 
dividual  levels  or  planes  of  atmospheric  data,  can  be  transformed  ad  infinitum 
using  a  matrix- vector  multiplication. 

DO  L  =  1,  GLOBAL.Z 

CALL  SparseHatVecMultCGeoToComp,  1.0,  In(l,l,L),  0.0,  Outl(l.l.L)) 
END  DO 

Alternatively,  it  the  transformation  of  the  entire  3-D  data  set  can  be  per¬ 
formed  with  one  matrix-matrix  product: 


CALL  SparseMatHatMult (  GeoToComp,  GLOBAL.Z,  1.0,  In,  0.0,  0ut2  ) 

Note  that  the  pole  rotation  is  trivial  (embarrassingly  parallel)  if  anv  given 
plane  resides  entirely  on  one  PE,  i.e..  if  the  .3-D  array  is  decomposed  in  the  z- 
dimension.  Unfortunately,  there  are  compelling  reasons  to  distribute  the  data  in 
vertical  columns  with  the  checkerboard  decomposition. 

Fig.  .0  compares  the  performance  of  the  unpermuted  rotation  with  that  of 
the  permuted  rotation  on  the  Cray  r.3E.  A  furthei-  optimization  is  performed  by 
replacing  the  non-blocking  MPI  primitives  used  in  ParBeginTransf  onn  by  faster 
Cray  SHMEM  primitives.  The  result  of  these  optimizations  is  the  improvement 
in  .scalability  from  tens  of  PEs  to  hundreds  of  PEs.  The  alasolute  performance  in 
GFlop/s  is  presented  in  Fig.  6. 


MPI  Pote  Rotatiofi:  Pcrtotmanc*  on  Cmy  T3E 


Optmztd  MPI>SHMEM  Pol#  Aouittan;  P#i1onn#no#  on  Cray  T3E 


Cray  T3E  Procssson  {300  MHr.) 


Cray  T3E  Ptocessofs  <300  MHj.) 


Fig.  5.  Wnli  a  nai\'e  decomposition  of  both  the  geojdiysical  and  computational  grids 
and  a  straightforward  MPI  implemeniation,  the  performances  at  tlie  left  for  the  72  x 
46  X  70  (*).  Itt  X  91  X  70  (X),  and  2Sf<  x  LSI  x  70  (o)  resolution.?  vield  good  .scalabilitv 
only  lo  10-50  proce.s.sors.  The  optimized  MPI-SHMEM  hybrid  version  on  the  right 
scales  to  nearly  the  entire  extent  of  the  machine  (ol.'  processors). 
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MPI-SHMEM  Hybrid:  Performance  of  the  rotation  of  one  field  on  T3E 


Number  of  T3E  (300MHz)  Processing  Elements 

Fig.  6.  The  GFlop/s  performances  of  the  grid  rotation  on  grids  with  144  x  91  x  70  (o). 
and  288  x  181  x  70  (x)  resolutions  is  depicted.  These  results  are  an  indication  that  the 
grid  rotation  will  not  represent  a  boti  leneck  for  the  overall  GEOS  DAS  system. 


7  Summary 

We  have  introduced  the  parallel  grid  manipulations  needed  by  GEOS  DAS  and 
the  PILGRIM  library  to  support  them.  PILGRIM  is  modular  and  extensible, 
allowing  us  to  support  various  types  of  grid  manipulations.  Results  from  the 
grid  rotation  problem  were  presented,  indicating  .scalable  performance  on  state- 
of-the-art  parallel  computers  with  a  large  number  (>  100)  of  processors. 

We  are  hoping  to  extend  the  us:ige  of  PILGRIM  in  GEOS  DAS  to  the  inter¬ 
face  between  the  forecast  model  and  the  statistical  analysis,  to  perform  further 
optimizations  on  the  library,  and  to  offer  the  library  to  the  public  domain. 
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Molecular  Dynamics  as  a  Natural  Solver 
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Abstract.  A  universal  character  of  molecular  dynamics  (MD)  method  is  discussed. 
Contrary  to  the  classical  area  of  MD  applications  in  microscopic  world 
investigations,  MD  simulation  of  mesoscopic  phenomena  is  considered.  Sample 
results  of  MD  simulations  of  the  Rayleigh-Taylor  instability  are  shown  and 
discussed  briefly.  To  cover  the  larger  time-and-space  scale  either  simplified  MD 
model  or  more  sophisticated  particle  based  algorithms  can  be  used.  In  the  first  case 
MD  method  can  be  directly  applied  as  a  predictive  display  in  computer  animation. 

In  the  second,  MD  code  can  be  a  “backbone”  of  efficient  computer  realization  of 
such  particle  based  methods  as  dissipative  particle  dynamics  and  smoothed  particle 
hydrodynamics.  Applications  of  MD  approach  in  global  optimization  problems  are 
discussed  also.  It  is  emphasized  that  inherent  parallelism  of  MD  method  resulting  in 
efficient  realization  on  MPP  systems  together  with  its  universal  properties  makes 
the  method  a  powerful  natural  solver. 

1  Introduction 

According  to  physics,  particles  interact  one  with  another  through  exchange  of  virtual 
objects,  e.g.,  photons  in  electromagnetics.  Changes  in  physical  states  of  particles,  i.e., 
their  positions,  momenta,  spins  etc.  result  from  their  interactions.  This  atomistic 
approach  reflects  an  important  principle  of  nature  and  human  logic,  i.e.,  construction 
of  complex  models  from  simple  elements  and  rules  via  their  mutual  "interactions",  or 
in  other  terms,  information  exchange. 

Virtual  particle  (VIP)  [1,2]  is  a  base  element  of  the  particle  based  computational 
model.  VIP  can  be  defined  on  different  levels  of  abstraction  [2]  e.g.  as:  atom,  particle, 
cluster  of  particles,  vehicle-target-obstacle,  genotype,  multidimensional  point,  UNIX 
process,  single  processor,  etc.  For  example,  taking  into  account  that  UNIX  processes 
can  “interact”  via  sending  and  receiving  messages  we  can  think  about  direct 
transformation  of  the  VIP  model  into  the  message-passing  model  of  parallel 
computations.  This  involves  the  change  of  the  the  VIP  level  of  abstraction  from  the 
particles  to  the  processes  exchanging  messages.  It  is  relatively  easy,  due  to  flexibility 
of  VIP  model  and  its  self-consistency. 

The  main  suggestion  put  forward  in  [1,2]  consists  in  the  elaboration  of  a  new 
strategy  of  parallel  realization  of  an  application  using  two  stages  of  mapping  (see 
Fig.l).  At  first,  a  problem  is  transformed  into  one  of  the  natural  solvers  (or  their 
hybrid)  and  virtual  particles  are  defined.  Then  the  method  is  realized  on  a 
multicomputer  system  through  the  transformation  of  virtual  particles  onto  a  virtual 
parallel  machine  model  [1].  Several  widely  used  natural  solvers  such  as:  Boltzmann 
lattice  gas,  lattice  gas,  simulated  annealing,  direct  Monte-Carlo,  cellular  automata. 
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genetic  algorithms,  neural  networks  and  others,  having  more  limited  scope  of  use  such 
as:  diffusion  limited  aggregation  (DLA),  percolation  etc.,  can  be  treated  as  particles 
based  techniques  in  accordance  with  the  definition  presented  in  [2],  All  these 
techniques,  have  been  used  in  physics,  chemistry  and  biology  for  many  years. 
Therefore,  the  second  stage  of  mapping  (i.e.,  its  implementation  on  a  multiprocessor 
architecture)  often  allows  us  to  exploit  ready  to  use  parallel  algorithms  or  at  least 
existing  knowledge  about  the  ways  of  parallelization  of  the  particle  based  methods.  In 
the  authors  opinion,  successful  mapping  of  a  problem  into  a  solver  is  crucial.  This  sort 
of  mapping  needs  a  creative  and  abstract  way  of  thinking  impossible  to  mimic  by 
current  and  future  generations  of  computer  systems. 


Fig.l.  Problem  mapping  onto  multiprocessor  model  through  its  transformation  into  a  natural 
■solver  [1]. 

Molecular  dynamics  method  (MD)  (a  well  known  technique  of  computational  physics 
and  one  of  the  Grand  Challenges  of  Science  [3]  problems)  can  be  taken  as  a  pure 
particle  paradigm.  The  goal  of  this  paper  is  to  show  that  MD  can  be  treated  as  a 
natural  solver,  i.e.,  a  universal  paradigm,  which  principles  come  from  nature  and 
which  can  be  used  as  a  solver  in  various  fields  of  science  and  engineering.  MD  and 
other  natural  solvers  like:  simulated  annealing,  genetic  algorithms,  neural  networks, 
cellular  automata,  etc.,  due  to  their  inherent  parallelism,  constitute  the  class  of 
powerful  computational  tools  when  empowered  by  a  parallel  system.  Increasing 
interest  in  implementation  of  these  techniques  on  multiprocessor  systems  constitutes 
the  natural  consequence  of  this  property. 

At  the  beginning  of  the  paper  the  mathematical  background  and  computer 
realization  of  MD  method  are  discussed  briefly.  Then  sample  results  of  MD 
applications  in  large-scale  computational  experiments  concerning  investigations  of 
Rayleigh-Taylor  instability  are  presented.  In  the  following  section  it  is  shown  that 
simplified  computer  realization  of  the  MD  method  can^  be  used  as  an  efficient 
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animation  technique  based  on  the  principal  physical  laws.  Since  the  visual  impression 
of  movement  plays  the  principal  role  in  animation,  physical  details  can  be  hidden  from 
the  observer  and  then  substantially  simplified.  Other  advantages  of  MD  applications 
for  computer  animation  are  also  discussed.  The  role  of  simulation  using  particles  as  a 
new  technique  of  global  minimum  search  is  introduced.  The  visual  clustering  problem 
is  considered  as  an  example.  Based  on  the  results,  conclusions  are  formulated  at  the 
end  of  paper. 

2  MD  principles 

Molecular  dynamics  is  a  computational  technique,  widely  used  in  physics,  chemistry 
and  biology  for  almost  35  years  (e.g.  [4]).  Its  basic  principles  are  shown  in  Fig.2. 

Each  particle  i  interacts  with  all  others  located  in  sphere  with  Rcu,  radius  according 
to  potential  energy  of  interactions.  In  the  simplest  case  two  body  pair  radial  potential 
function  (j)(r,y,)  depends  on  the  distance  r,y  between  tbe  particles.  For  more  complex 
molecules,  the  potential  function  can  be  more  sophisticated.  Let  the  pair  force  fy  =  - 
V<))  (r,, ),  while  the  total  force  F, ,  which  acts  on  a  single  particle  i,  is  the  sum  of  pair 
forces  fy  of  its  neighbour  particles  within  Rcu,  sphere. 


Fig.2.  Basic  principles  of  MD  paradigm. 


Time  evolution  of  particles,  is  defined  by  the  Newtonian  equations  of 

motion:. 


m. 


dt 


=  Zf,-/ 

Je  S(i  ,Rcut ) 


dt 


(1) 


where:  v,  and  r,  -  represent  velocity  and  coordinates  of  particle  /,  respectively.  The 
computer  implementation  of  MD  techniques  consists  of  subsequent  calculation  of 
forces  and  particle  movements  for  each  time  step. 

A  set  of  simulated  particles  is  confined  (in  the  most  cases)  in  a  rectangular  box 
with  periodic  boundary  conditions  (PBC)  implied.  This  assumption  is  important  to 
obtain  valuable  simulation  results.  The  number  of  particles,  M,  is  limited  by  the 
computational  power  of  computers  (Af=10®  on  the  fastest  parallel  system  [5]).  In  the 
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real  world,  one  mole  of  liquid  contains  10“  molecules.  PBC  enables  to  mimic  infinity 
of  a  medium  using  limited  number  of  molecules.  However,  this  assumption  works  well 
only  for  time  scale  limited  by  the  size  of  computational  box  divided  by  sound  speed  in 
a  medium  simulated.  Because  the  former  one  depends  on  M,  to  get  more  accurate 
results  of  phenomena  under  investigation,  larger  samples  of  molecules  should  be  taken 
into  account.  Assuming  that  a  molecule  may  consist  of  hundred  and  thousands  of 
atoms  (particles)  and  its  simulation  is  much  more  slower  than  for  a  simple  molecule  in 
liquid  Argon  for  example,  the  evolution  of  large  number  of  particles  simulated  in 
longer  and  longer  time  scales  becomes  the  great  challenge  for  the  fastest  computer 
systems  ever  constructed.  Therefore,  the  serious  research  has  been  going  on  for  years 
now  to  implement  MD  codes  on  the  top  performance  computer  systems  [6].  For 
parallel  implementation  of  MD  method,  geometric  decomposition  is  usually  used.  In 
Fig.3  we  can  see  typical  decomposition  of  the  computational  box  for  distributed 
computations  on  the  ring  of  workstations  (Fig.3a)  and  for  parallel  processing  on 
MPP  tightly  coupled  architectures  (Fig.  3b). 
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Fig.3.  Two  approaches  for  MD  domain  parallelism.  The  arrows  show  directions  of  information 
exchange  between  a  domain  (shaded)  and  its  neighborhood.  For  (a)  the  load  balancing  is 
realized  changing  the  strips  width  while  for  (b)  it  is  more  fine  grained  though  complicated.  ° 

As  is  shown  in  [6],  the  progress  in  hardware  and  software  development  lets  to  increase 
the  number  of  atoms  simulated  using  MD  codes  from  hundreds  in  late  seventies  to 
billions  in  the  middle  of  nineties.  The  parallel  MD  codes  reach  95%  efficiency  on 
hundreds  of  processors.  A  vast  amount  of  literature  and  MD  software  for  the  full 
spectrum  of  vector  and  multiprocessor  architectures  are  available.  From  this  point  of 
view,  the  MD  method  fulfills  the  important  condition  which  the  natural  solver  should 

posses.  However,  the  most  relevant  feature  of  natural  solvers  consists  in  their 
universality. 

3  Large-scale  MD  simulations  of  physical  phenomena 

The  classical  field  of  interest  of  MD  simulations  covers  the  microscopic,  short-time 
phenomena  in  liquids  and  solids.  Due  to  time  and  space  averaging  of  stochastic 
functions  and  variables  one  can  obtain  integral  and/or  differential  parameters  of  a 
medium  investigated.  Fitting  simulation  results  to  the  experimental  and  theoretical 
values,  one  can  find  the  proper  model  of  molecules  and/or  potential  energy  of  the 
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interacting  particles.  Moreover,  it  is  possible  to  observe  reactions  of  separate 
molecules  and  the  whole  system  on  the  external  stimulus.  Nevertheless,  all  these 
phenomena  occur  in  abstract  microscopic  world,  which  (as  seems  to  be)  limits  the 
field  of  MD  approach  application. 

The  first  MD  experiments  [7,8]  in  which  not  statistical  fluctuations  but  rather 
collective  movement  of  simple  Lennard-Jones  particle  ensembles  were  investigated, 
show  that  even  for  relatively  small  number  of  particles  in  short-time  simulations  it  is 
possible  to  observe  the  striking  resemblance  of  patterns  created  in  microscopic  and 
macroscopic  worlds.  Increasing  the  number  of  particles  to  millions  it  is  possible  to 
simulate  the  phenomena  in  mesoscale  (i.e.,  where  the  size  of  samples  is  Ipm  of  order 
and  simulation  time  is  tens  of  nanoseconds),  e.g.,  fluid  flows  [8,9],  crack  formations 
[10],  hydrodynamical  instabilities  creation  [1 1,12],  Such  investigations  are  important 
while  classical  models  based  on  continuous  matter  and  momenta  equations  (e.g. 
Navier-Stokes  formulae  in  hydrodynamics)  are  insufficient  and  the  assumptions  of 
continuity  are  not  valid  any  longer.  The  same  concerns  description  of  phenomena 
having  their  origins  in  microscale  and  resolving  in  macroscale.  To  simulate  them  using 
classical  continuous  models,  artificial  fluctuations  are  introduced.  This  results  in  the 
lack  of  any  information  about  the  beginning  stage  of  mixing  process,  its  causality  and 
start  up  time. 

The  first  results  of  simulations  of  the  Rayleigh-Taylor  instability  using  pure  MD 
parallel  code  are  presented  in  [12].  The  computer  experiment  consists  in  simulation  of 
mixing  of  two  particle  layers.  The  first  layer  consists  of  heavy  particles  and  the  second 
one  -  placed  below  -  is  made  of  light  particles.  The  gravitational  field  directed  from 
the  heavy  layer  to  the  lighter  one  makes  the  system  unstable.  Due  to  statistical 
fluctuations  two  fluids  begin  to  mix.  This  sort  of  instability  belongs  to  the  hardest  case 
for  simulation  using  classical  hydrocodes.  Especially  its  initialization  is  not 
investigated  yet  in  details  because  of  the  lack  of  causality  factor  in  the  classical 
equations  of  fluid  dynamics.  As  one  can  see  in  Fig.4,  the  evolution  of  mixing  process 
using  MD  code  is  similar  to  this  observed  in  experiment  and  those  obtained  from 
simulations  which  use  classical  hydrocodes.  Unlike  in  simulations  which  use 
hydrocodes,  however,  the  process  is  spontaneous,  i.e.,  not  initialized  artificially.  The 
fluctuations  represent  the  real  causality  factor  lacking  in  the  former  models.  Due  to 
this  advantage  it  is  possible  to  investigate  more  thoroughly  time  evolution  of  mixing 
layer  not  only  for  infinitely  thick  liquid  layers  but  also  for  the  layers  with  free  surface 
(see  Fig.4).  For  example,  as  one  can  see  in  Fig.5,  two  mixing  regimes  can  be 
distinguished.  The  first  one  is  observed  at  the  beginning  of  process  when  only  thin 
boundary  layers  of  two  liquids  take  part  in  mixing.  While  the  sound  wave  -  caused  by 
turn  on  of  the  acceleration  field  -  reflects  from  the  bottom  of  computational  box,  the 
process  changes  in  character  and  mixing  gets  faster. 

The  resemblance  of  the  simulation  results  of  similar  processes  in  micro  and 
macroscales  inclines  to  the  conclusion  that  by  rescaling,  changing  the  definition  of  a 
particle  and  interparticle  potential  we  can  use  the  MD  model  for  simulation  of 
physical  phenomena  in  macroscale  [13]. 
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Fig.4.  The  snapshots  of  the  Rayleigh-Taylor  instability  simulation  using  a  million  of  particles 
for  300.000  timesteps  in  MD  experiment  The  colors  show  the  particles  density.  Simulation  was 
performed  using  MD  parallel  code  in  PVM  environment  on  48  processors  of  Cray  T3E  system. 


The  advantages  of  particle  approach  over  the  computational  methods,  which  use  finite 
elements  or  finite  differences,  are  evident.  The  most  important  factors  are  as  follows: 

•  the  lack  of  any  grid, 

•  simple  and  flexible  computational  model, 

•  simple  definition  of  discontinuities, 

•  efficient  parallel  codes, 

•  minor  problems  with  complicated  boundaries  and  inhomogenities. 


Fig.5.  The  growth  of  mixing  layer  for  two  different  simulations  (different  thickness  of  the 
heavy  layer  assumed). 

The  problems  with  interparticle  potential  definition  can  be  overcome  using  models 
for,  so  called,  dissipative  particle  dynamics  method  [14]  or  deriving  it  directly  from 
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the  particle  formulation  of  the  Navier-Stokes  equations  using  smoothed  particle 
hydrodynamics  method  [15].  Another  approach  is  used  for  granular  media 
investigations  (e.g.  [16])  where  the  particles  have  different  shapes  and  interaction 
potential  is  very  sophisticated.  Nevertheless,  the  “backbone”  of  all  these  models  is 
based  on  the  pure  MD  formulation  and  their  parallel  realization  on  MD  parallel 
algorithms  and  methods. 


Fig.6.  Two  balls  made  of  particles  hitting  one  another.  MD  3-D  simulation. 

We  can  expect,  of  course,  that  making  the  model  more  exact  (e.g.  due  to  more 
realistic  potentials  applied)  thus  more  complicated,  one  can  obtain  eventually  the 
results  of  MD  simulations,  which  are  in  good  quantitative  agreement  with  an 
experiment.  However,  the  fact  that  even  for  the  simplest  implementation  of  the  MD 
method  the  quality  of  results  obtained  is  astonished  emphasizes  the  universal  character 
of  MD  approach.  For  example,  some  effects  in  granular  dynamics,  similar  to  these 
observed  in  the  reality  can  also  be  simulated  using  the  simplest  “soft  balls”  MD 
algorithms  (see  Fig.6).  This  fact  can  be  exploited  for  animation  purposes. 

4  Method  of  particles  as  a  predictive  display 

In  some  situations  detailed  physics,  which  stays  behind  phenomena  under 
consideration,  is  not  crucial.  In  animation  methods,  which  assume  some  level  of 
agreement  with  physical  laws  (so  called,  predictive  display)  more  important  is  visual 
impression,  than  accurate  quantitative  agreement  with  the  reality. 
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Assume  that  we  are  going  to  animate  a  thin  flexible  surface.  This  is  a  very 
complicated  task  in  fact.  As  was  shown  in  [18],  such  animation  in  real  time  is 
impossible  due  to  complicated  mathematics  models  laying  behind  a  fabric  dynamics. 
Moreover,  the  simulation  needs  supercomputer  power  when  a  typical  FEM  algorithm 
is  involved. 

Imagine  that  the  fabric  is  made  of  particles.  At  the  beginning  of  simulation  the 
particles  are  placed  in  the  nodes  of  hexagonal  or  rectangular  grid  (see  Fig.7). 

Each  particle  interacts  with  its  neighbors  via  a  semi-harmonic  potential  (for  more 
details  see[19]).  Let  us  introduce  gravitation  and  friction  forces  in  Eqs.(l).  Using 
leap-frog  numerical  scheme  to  the  Newton  equations  (1)  we  obtain: 


(\-(p) 
0  +  <P) 


tlrf  -4  )r;  +  Ai  J ,  r,-  =  r»  +  y- 


■  At 


(2) 


assuming  that  the  friction  force  is: 

F,=-A-v,  and 


k  A  , 

a  =  — ,  ^  = - At 

m  2m 


fij  -  current  distance  between  particles  /  andy, 

aij  -  initial  distance  between  /  and  its  neighbours  on  the  mesh  at  the  beginning  of 
simulation,  ^  ® 

m  -  particle  mass, 

k  -  a  parameter  of  the  semi-harmonic  interparticle  potential  assumed, 

At  -  time  step. 

Using  MD  code  modified  in  such  a  way,  realistic  pictures  of  the  fabric  dynamics 
can  be  obtained  during  on-line  animation  on  a  standard  Pentium  II  based  PC  (see 
Fig.8  for  example,  see  also  [17,19]). 

Next,  assume  that  several  moving  objects  are  animated.  For  very  simple  objects 
(see  Fig.9)  it  can  be  done  easily  using  the  MD  code  on  a  PC  computer.  However, 
when  the  objects  are  more  complicated  and  each  consists  of  about  10.000  particles 
machine  on-line  animation  is  possible  using  a  parallel 

As  shown  in  [20],  objects-to-processor  mapping  can  be  used.  More  than  one  object 
on  a  single  processor  is  recommended.  Additionally,  two  processors  are  used  for 
graphical  service  and  animation  supervision  (master  processor)  respectively  Load 
balancing  is  organized  in  such  a  way,  that  two  colliding  objects  are  moved  to  a  single 
processor.  If  the  number  of  objects  taking  part  in  collision  is  larger  than  2  the  number 
of  processors  used  for  simulation  of  this  event  is  increased.  The  processors  which  are 
used  in  simulation  of  dynamics  of  the  remaining  objects  communicate  only  with 
master  processor  to  check  collision  conditions.  As  shown  in  Fig.  10,  for  four  colliding 
objects  the  optimal  number  of  slaves  is  2  (plus  master  and  visualization  processors). 
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Fig.9.  Fragments  of  trajectories  of  the  simple  objects  animated  using  MD  approach.  The  scene 
consists  of:  2  sticks  (A),  2  circles  of  various  radiuses  (B),  a  square  (C)  and  a  triangle  (D). 
One  can  see  the  collisions  between  the  objects  and  the  square  rotating  after  collision 
against  the  wall. 
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Fig.lO.  Timings  for  animation  of  a  scene  (with  and  without  of  load  balancing),  which  consists 
ot  four  moving  cubes  (2000  particles  each).  SEQ  -  sequential  version,  M  -  master  processor,  V 
-  processor  for  visualization,  K  -  slaves. 

5  MD  in  global  optimization  problems 


The  change  of  particle  abstraction  level  and  interpretation  of  interparticle  forces 
makes  possible  MD  code  application  solving  problems  of  a  vehicle  navigation 
between  obstacles  and  search  of  global  minimum  of  multidimensional  functions. 

In  the  first  case  the  shortest  or  the  most  feasible  path  of  a  moving  vehicle  from  a 
starting  point  to  a  target  is  looked  for  in  presence  both  of  static  and  dynamic  obstacles. 
The  application  of  the  MD  model  for  solving  this  problem  is  straightforward.  Let  us 
assume  that  the  vehicle  represented  by  a  particle  is  attracted  by  the  target.  The 
obstacles  are  made  of  static  particles,  which  repel  the  moving  object.  Then  the  object 
moves  in  accordance  with  Newton  laws. 


An  MD  approach  to  the  navigation  problem  [21]  differs  from  the  classical 
naviption  algorithms.  This  difference  concerns  a  dynamic  layer  of  the  problem 
considered,  i.e.  the  movement  scenario,  which  is  directly  connected  by  physical  laws 
with  the  vehicle-environment  (obstacles  and  terrain)  interactions.  This  makes  the 
algorithm  more  flexible  and  open  for  verifications  and  improvements.  Unlike  graph 
theory  algorithms  both  static  and  moving  obstacles  can  be  considered.  An  example  of 
the  vehicle  paths  are  shown  in  Fig.l  1,  assuming  the  presence  of  static  obstacles  only. 
Even  for  more  complicated  scenario  the  parallel  realization  of  MD  algorithm  is  not 
necessary  because  only  local  interaction  between  the  object  and  obstacle  are 
considered.  While  moving  obstacles  are  taken  into  account,  the  parallel  algorithm  can 
be  similar  to  that  described  earlier  for  animation  purposes. 

The  problem  of  global  optimization  in  a  multidimensional  space  of  a  multimodal 
function  is  one  of  the  most  important  and  complex  goals  in  many  branches  of  science 
and  engineering.  Because,  in  general,  the  problem  is  unresolved  using  deterministic 
approaches  many  stochastic  and  heuristic  methods  were  constructed  in  search  of 
immune”  (problem  independent)  optimizer.  According  to  our  best  knowledge  such  a 
method  does  not  exist,  though  success  of  approaches  such  as  genetic  algorithms  and 
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simulated  annealing  is  out  of  question.  MD,  alike  both  of  these  heuristics,  bases  on 
the  principles  which  come  from  nature.  Let  us  assume  that  in  Eqs.(l)  a  small 
dissipative  factor  is  introduced.  After  some  time,  when  kinetic  energy  of  the  particle 
system  is  removed,  the  particles  stop  moving  and  a  minimum  of  the  total  potential 
energy  of  the  system  is  gained.  When  dissipation  of  the  kinetic  energy  is  sufficiently 
slow,  the  global  minimum  is  achieved. 


The  paths  from  starting  point  to  the  target  for  different  initial  velocities  of  a  vehicle. 
The  most  feasible  path  is  the  shortest  one. 

In  Fig.  12  one  can  see  a  realization  of  this  idea.  A  global  minimum  of  a  multimodal 
and  multidimensional  function  f(x)  is  searched.  Initially  the  particles  are  scattered 
randomly  in  the  function  domain.  The  particles,  which  coordinates  are  Xj  (i=l,...,ilf), 
interact  via  two-body,  one-directional  forces.  Only  particle  representing  lower  f(x) 
value  attracts  the  other  one.  A  particle  which  gives  the  lowest  function  value  for  a 
current  simulation  step  is  stopped.  The  force  between  two  particles  i  and  j  is 
dependent  on  the  difference  between  the  function  values  in  X\  and  Xj,  i.e.,  lf(Xi)-f(Xj)l. 
As  one  can  see  in  Fig.  12  the  right  solution  is  found  for  relatively  small  number  of 
particles  and  without  f(x)  gradient  calculation. 

MD  approach  to  global  optimization  was  successfully  applied  in,  so  called,  visual 
clustering  and  non-linear  mapping  problems  [22].  The  principal  goal  of  non-linear 
mapping  algorithms,  consists  in  such  a  generation  of  points  in  2(3)-dimensional  space 
that  the  distances  between  them  approximate  the  distances  between  respective  N- 
dimensional  points,  which  represents  the  measurement  data.  The  method  lets  to 
visualize  the  multidimensional  forms  in  2(3)-dimensional  space.  This  is  accomplished 
by  minimizing  the  criterion  function 

£  =  B) 

I  J 

The  criterion  (3)  is  the  generalized  case  of  the  well  known  Sammon’s  criterion 
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where:  Lr.  -  is  squared  distance  between  points  i  and  j  in  N-dimensional  space,  - 

is  squared  distance  between  respective  i  and;  points  in  2(3)-D  Euclidean  space,  w  and 
m  -  parameters  {m>  1  and  we  ( - 1 ,0, 1 } ). 


Fig.l2.  The  application  of  MD  paradigm  in  search  for  global  minimum  of  multimodal  and 
multidimensional  (10-D)  test  function. 
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Fig.l3.  The  snapshots  of  MD  mapping  process  of  100-dimensional  data  placed  on  the  sphere. 


A  new  method  proposed  in  [2,22],  uses  MD  for  minimization  of  the  criteria  (3,4).  It  is 
assumed  that  in  2(3)-D  M  particles  are  scattered  randomly.  Each  particle  corresponds 
to  the  respective  N-dimensional  data  point.  The  particles  interact  one  with  another  via 
two-body  potential  dependent  on  D„  and  and  equal  to  V..{D...r..).  The  particles 

move  according  to  Newton’s  laws  of  motion.  The  friction  force  assumed  removes  the 
kinetic  energy  from  the  particle  system,  which  stops  moving  eventually  when  the 
potential  energy  (1)  reaches  global  minimum.  The  positions  of  particles  reflect  the 
final  result  of  mapping  (see  Fig.l  1). 
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6  Conclusions 

In  the  paper  it  is  shown  that  the  MD  model  can  be  treated  as  a  natural  solver  which 
has  broad  scope  of  use  in  different  fields.  The  application  of  MD  simulation  in 
mesoscopic  scales  for  studies  of  collective  movement  of  particles  can  be  a  valuable 
supplement  for  classical,  continuous  models.  For  studies  of  nonlinear  phenomena  such 
as  Rayleigh-Taylor  instability,  which  has  their  origins  in  microscale,  MD  can  be 
treated  as  an  unique  tool  for  simulation  of  initial  phase  of  mixing  and  observation  of 
instabilities  evolution.  Moreover,  MD  algorithms  yield  a  simple  and  effective  parallel 
computational  code,  which  can  be  treated  as  a  “backbone”  for  other  more 
sophisticated  particle  based  methods  such  as  dissipative  particle  dynamics  and 
smoothed  particle  hydrodynamics  used  in  simulations  of  the  macroscopic  world 
phenomena.  The  change  of  definition  of  a  particle  from  single  atom  to  the  cloud  of 
matter  and  changes  in  the  interaction  potentials  assumed,  does  not  affect  the  structure 
of  the  parallel  codes  used  for  pure  MD  formulation.  The  MD  model  can  be  also 
applied  for  animation  purposes  of  macroscopic  objects  giving  an  impression  that  the 
objects  dynamics  is  in  good  agreement  with  physical  laws,  though  detailed  physics 
may  be  considerably  simplified. 

The  encouraging  results  of  tests  of  MD  applications  in  global  optimization 
problems  such  as  vehicle  navigation  problem  and  search  of  global  minimum  of 
multimodal  and  multidimensional  functions  show  that  miscellaneous  branches  of 
science  are  subordinated  to  the  similar,  general  and  universal  rules,  while  the 
computer  science  plays  the  important  role  in  their  extraction  and  dissemination. 
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Abstract  Ihe  goal  of  this  papa  is  to  propose  cost-performance  critaia  which 
can  be  u^  to  take  co-design  decisions.  ITie  criteria  are  simplifieH  with  some 
assumptions,  and  are  used  to  modify  the  hardwrare  design  of  a  fine  grain 
multqrrocessor  architecture.  Ihe  modifications  optimize  the  execution  time  of 
the  elemental  opoations  (addition,  substraction,  con^uirison  and  product).  Ihe 
criteria  are  a  trade-off  measure  between  the  hardware  complexity  and  the 
execution  time  of  file  elemental  operations.  Ihe  modifications  inq>rove  the 
system  efficiency  while  the  cost  is  maintained. 


1  Introduction. 

When  some  modifications  should  be  done  in  a  hardware  design,  and  the  cost  of  the 
system  is  important  too,  one  main  question  is:  the  performance  increase  justifies  the 
cost  increase?.  However,  parallel  architectures  allow  the  interchange  between  the 
processor  element  complexity  and  the  number  of  processor  elements  of  the  system 
while  the  total  cost  of  the  system  is  maintained.  This  means  that,  for  the  same  total 
cost,  we  can  have  more  complex  processor  elements,  but  a  lower  number  of  them,  or 
we  can  have  less  complex  processor  elements,  but  a  higher  number  of  them.  It  is 
obvious  that  there  will  exist  a  trade  off  between  the  processor  element  complexity 
(imtary  cost)  and  the  system  size  that  makes  maximum  the  system  performance  for  a 
given  cost.  So,  the  new  question  is:  the  hardware  modification  increases  the  system 
performance  while  maintaining  the  total  cost?.  It  is  clear  that  if  the  answer  is  yes,  the 
modification  can  be  immediately  accepted,  otherwise  the  modification  will  be 
accepted  or  not  depending  on  the  cost  go^. 

This  paper  proposes  cost-performance  criteria  that  allow  to  decide  if  a 
modification  can  be  immediatelly  accepted  or  not.  The  criteria  are  used  to  evaluate 
hardware  modifications  which  try  to  decrease  the  execution  time  of  the  software 
instructions  for  elemental  operations. 

But,  what  was  the  problem  that  led  us  to  this  point?.  Some  time  ago,  we 
designed  a  vision  oriented  SIMD  architecture  [1],  but  it  is  well  known  the  saturation 
effect  that  SIMD  architectures  show:  in  most  cases,  the  slope  of  the  performance 
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function  decreases  as  the  number  of  the  processor  elements  increases  for 
mtermediate  and  high  level  vision  algorithms.  We  have  demonstrated  in  previous 
works  [1]  that  the  reconfiguration  of  the  dat^jath  width  palliates  this  problem. 

The  reconfiguration  consists  in  the  interchange  between  the  number  of  processor 
elements  of  the  system  and  their  datapath  width.  So,  we  can  have  a  system  integrated 
by  n  processor  elements  with  1-bit  datapath  width  and  we  can  reconfigure  it  to  a 
system  integrated  by  n/B  processor  elements  with  B-bit  datqjath  width.  The  problem 
ansed  when  we  evaluated  the  speed  of  the  hardware  for  elemental  operations  in 
reconfigurated  mode.  This  speed  was  low,  and  hardware  modifications 
necessaiy  for  a  high  performance  in  reconfigurated  mode. 

Then,  in  order  to  have  objective  parameters  to  measure  the  convenience  of  a 
hardware  modification,  we  proposed  the  cost-performance  criteria  which  are 
explained  in  this  p^r. 

Other  worics  have  been  developed  in  the  literature  about  this  theme.  References 
P].  [3]  give  general  ideas  about  the  hardware-software  co-design.  However,  only 
general  criteria  are  shown  iii  [4]  and  [5],  In  [4]  are  presented  optimization  criteria 
whxch  call  be  ^lied  to  architectures  that  show  a  linear  cost  in  their  communication 
network  (i.e.  a  processor  element  can  alw^  communicate  with  the  same  processor 
elements  for  all  system  ^s).  In  [5]  the  criteria  take  into  account  a  non-linear 
dependence  on  the  cost  with  the  interconnection  network  and  can  be  ^lied  to  more 
complex  connection  patterns. 


2  Gist-performaoce  criteria- 

The  total  cost  of  a  system  may  be  very  difBcult  to  model:  hardware,  software  and 
peripherical  circuitry,  among  others,  are  different  parts  of  the  cost.  In  order  to  obtain 
reliable  models,  [4]  and  [5]  take  into  account  the  hardware  cost  due  to  the  silicon 
area,  which  is  the  most  important  in  most  rases. 

We  have  used  the  criteria  described  in  [4]  because  in  our  SIMD  architecture 
every  processor  element  can  communicate  with  the  same  neighbours  (North,  East, 
Sout^  West)  without  dependence  on  the  system  size.  Reference  [4]  gets  the 
condition  which  a  modification  has  to  verify: 


A 

Af 
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Where: 

Aj/f  =  Initial/final  area,  before/after  the  modification, 

Ni/f  =  Initial/final  number  of  processor  elements. 

E(Ni/f ,  Aj/f )  =  Initial/final  system  efBciency. 

Tpoop  ( A/f  )  “  Time  per  optimized  operation  in  the  initial/final  conditions. 

Tnoop/oop  ~  Time  ^\4uch  is  needed  by  a  processor  element  to  execute  the  non 
optimized/optimized  operations  of  the  task. 

If  the  modification  implies  a  higher  area  for  the  processor  element,  then 
normally  E(Nj,Ai)/E(Nf  ,Af )  >  1  and  a  harder  condition,  \^ch  is  easier  to 
verify,  is: 

^>[l+toop(Ai)x(R-l)]  .  (4) 

The  simplified  procedure  to  evaluate  the  convenience  of  a  modification  is  the 
following  (we  suppose  that  initial  conditions  are  known); 

a)  Calculate  the  final  area  Af . 

b)  Obtain  the  final  time  per  optimized  operation  Tpoop(Af). 

c)  Get  the  reduction  &ctor  R. 

d)  Find  the  time  relation  between  the  optimized  operation  and  the  total  tacif  in 
the  initial  conditions  toop  (  Ai ) . 

e)  Check  the  eq.  (4).  If  it  is  verified  and  the  modification  has  increased  the 
processor  element  area,  then  the  modification  can  be  accepted,  else  it  is 

necessary  to  evaluate  the  final  efficiency  E(Nf ,  Af )  and  to  check  the  eq.  (1). 


3  Criteria  application  to  the  addition  operation. 

Figure  1  shows  an  addition  example  the  data  1  is  added  to  the  data  2  and  the  result  is 
obtained.  This  type  of  addition  (reconfigurated  mode)  presents  two  main  problems: 

a)  The  carry  generated  by  the  most  significant  processor  element  should  be 
communicated  to  the  least  significant  processor  element.  Besides,  the 
communication  path  depends  on  the  number  of  processor  elements  rows  that 
integrate  a  multibit  processor  (see  fig.  2).  For  an  even  number  of  rows,  it  is 
necessary  a  horizontal  communication  followed  by  a  vertical  one,  v^le  for  an 
odd  number  of  rows,  it  is  only  necessary  one  vertical  communication. 
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b)  The  least  significant  processor  element  receives  zero  in  its  ALU  cany  input 
for  the  first  sum,  and  for  long  data  (more  than  one  word),  it  receives  the  cany 
from  the  most  significant  processor. 
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PRl  PR2  PR3  PIU 


PRt  PR7  PR6  PR5 
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PRl  PRl  PR3  PR4 


0 

■ni 


PRt  PR7  PR6  PR5 


Figure  1.  Muhibit  addition  example. 

These  and  other  considerations  makes  the  multibit  addition  no  efBcient.  It  is 
clear  that  for  a  100%  of  efBciency  these  two  terms  should  be  equal: 

a)  Number  of  clock  cycles  to  execute  one  monobit  addition. 

b)  Number  of  clock  cycles  to  execute  B  multibit  additions.  Remember  that  B  is 
the  datapath  width  in  the  reconfigurated  work  mode. 
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Figure  2.  Communication  cany  path  dqiendmg  on  the  number  of  processor  element  rows. 


Actually,  a  multibit  processor  is  integrated  1^  B  processor  elements,  so  a  Mr 
companson  is  to  evaluate  the  clock  cycles  for  the  same  number  of  operations  in  both 
work  modes  (monobit  and  reconfigurated).  This  implies  the  previous  equality 
because  B  additions  are  executed  in  parallel  in  monobit  mode,  and  their  time  cost  is 
the  numter  of  clock  cycles  for  one  monobit  addition,  so  B  additions  should  be 
executed  in  multibit  mode.  It  is  clear  that  because  of  the  bit  paralellism  and  for  100% 
efBciency,  eveiy  multibit  addition  should  execute  in  1/B  times  the  number  of  clock 
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^cles  of  one  monobit  addition.  However,  due  to  the  hardware  design  and  the 
d^erence  be^n  the  datalength  and  the  dat^h  width  of  the  architecture  the 
efficiency  will  be  lower  than  100%. 

Figi^  3  shows  the  efficiency  for  the  addition  operation  with  the  initial  hardware 
^ign.  to  order  to  increse  its  efficiency  we  have  modificated  the  hardware  design 
The  modification  aUows  the  cany  communication  between  the  most  significant 
processor  element  and  the  least  significant  processor  in  a  single  clock  cycle. 


DATAlfiNOIH 

Figure  3.  Efficiency  (%)  for  die  multibit  addition  respect  to  the  monobit  addition  without 

hardware  modification. 


The  hardware  modification  adds  one  input  to  the  output  multiplexer  and  to  the 
ALU  cany  mput  multiplexer.  Figure  4  shows  the  efficiency  with  the  hardware 
m^cation  mcluded  in  the  design.  Note  that  the  efficienty  has  been  duplicated. 
This  means  that  the  execution  time  per  multibit  addition  has  been  reduced  to  half. 


ZMTAUMTTB 

Figure  4.  Efficiency  (%)  for  the  multibit  addition  respect  to  the  monobit  addition  with 
hardware  modification. 
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The  increase  on  the  processor  element  area  due  to  the  modification  is  2%  using 
ES2  library  for  0,7pm  double  metal  CMOS  technology. 

Once  we  have  the  time  relation  and  the  area  relation,  we  can  evaluate  the  eq.  4.  In 
this  case:  A^/Af  =  0.98  and R=  0.5  . 

So,  fi'om  eq.  4,  to<,p(Ai)^3.9%  .  This  means  that,  for  the  modification 

accei^Mce,  at  least  the  3.9%  of  the  total  execution  time  of  the  tacif,  in  the  initial 
conditions,  should  be  dedicated  to  addition  operations  in  reconfigurated  mode. 

A  global  vision  task  is  normally  divid^  into  different  subtasks.  Every  subtask 
m^  have  part  of  the  olgect  code  that  is  executed  in  monobit  mode,  and  other  part 
executed  in  reconfigurated  mode.  Besides,  not  all  operations  are  jn 

reconfigurated  mode.  So,  depending  on  the  vision  task,  the  hardware  modification 
will  be  or  not  accepted. 


4  Conclusions. 

Cost-performance  criteria  have  been  proposed  in  this  paper  that  can  be  appUed  to 
multiprocessor  architectures  with  no  cost  dependence  on  the  intercormection  network 
(the  number  of  interconnections  per  processor  element  does  not  depend  on  system 
size).  The  criteria  have  been  simplifi^  to  make  the  equations  easier  to  evaluate  and 
one  example  has  been  explained. 

The  example  demonstrates  that  the  criteria  can  be  extended  to  other  hardware 
modifications.  The  criteria  measure  the  interchange  between  the  processor  element 
wmplexity  and  its  unitary  cost,  while  the  total  cost  of  the  system  is  maintained. 
However,  this  interchange  allows  to  maximize  the  system  performance.  This  iwAanc 
that  for  a  given  total  cost,  we  can  obtain  the  processor  element  design  that 
maximizes  the  system  performance. 
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Abstract.  We  present  an  enhanced  data  availability  I/O  Subsystem 
model  for  ParFiSys,  a  Distributed  and  Parallel  File  System.  We  evaluate 
the  application  of  data  redundancy  at  the  different  levels  of  the  I/O 
hierarchy.  A  virtual  distributed  and  redundant  device,  known  as  VRAID, 
is  used  as  the  basis  to  achieve  both  I/O  accesses  parallelism  and  better 
fault  tolerance. 

Keywords;  Parallel,  file  system,  data  availability,  redundancy. 


Introduction 

ParFiSys  [2]  is  a  Distributed  and  Parallel  File  System  ^  devoted  to  exploit  as 
much  as  possible  the  I/O  Subsystem  on  architectures  where  several  I/O  nodes 
are  interconnected  by  a  high  performance  network.  ParFiSys  early  design  was 
focused  on  improving  1/ 0  performance,  and  data  availability  problems  due  to  a 
large  number  of  underlying  devices  [9]  were  not  taken  into  account. 

In  this  paper,  we  describe  a  new  redundant  I/O  Subsystem  model  for  ParFiSys 
that  should  be  able  to  offer  data  availability  even  on  underlying  device  failures. 
We  detail  the  algorithms  used  to  improve  performance  by  minimizing  both,  the 
impact  of  redundancy  management  on  communications,  and  the  reconstruction 
phase  overhead.  We  evaluate  the  model  over  a  massively  parallel  architecture 
simulator  that  has  also  been  developed  [10, 12, 13], 

1  I/O  Subsystem  Model 

The  I/O  Subsystem  (Fig.  1  is  built  on  the  I/O  hardware  of  a  massively  parallel 
machine  with  a  high  performance  interconnection  network.  The  physical  storage 
devices  are  distributed  over  several  I/O  network  nodes.  Additionally,  two  logical 
storage  devices  are  defined,  one  per  I/O  node  server  (SERV),  that  manages 
remote  accesses  to  any  other  storage  device  of  the  node,  and  a  single  virtual 
redundant  storage  device  known  as  VRAID,  that  distributes  the  data  all  over 
the  SERV  devices  of  the  whole  system.  ‘ 

’  Thanks  to  Professor  De  Miguel  for  his  technical  advice. 

ParFiSys  was  developed  at  the  PolytechnicaJ  University  of  Madrid,  under  the  ES¬ 
PRIT  project  P5404  funded  by  European  Union. 
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Fig.  1.  I/O  Subsystem  Architecture 


The  Raids  and  VRAID  can  be  configured  as  level  0,  4  or  5  [4, 3].  Usually  the 
redundancy  unit  is  known  as  stripe-unit,  and  is  composed  of  one  storage  unit  of 
each  underlying  device,  one  of  which  (the  parity  unit)  contains  the  exclusive-OR 
calculation  of  all  the  others.  It  is  important  to  note  here  that  at  any  time  the 
parity  unit  contents  must  be  consistent  with  the  rest  of  the  information  stored  in 
the  stripe,  so  a  locking  mechanism  must  be  used  to  organize  concurrent  accesses 
involving  parity  units.  This  means  that  we  will  need  to  use  locks  at  every  access 
but  when  reading  a  free  of  fault  device. 

VRAID  Distributed  Lock  Management  In  the  VRAID,  the  parity  calculation  is 
done  at  the  node  that  makes  the  I/O  request,  so  a  lock  mechanism  is  required 
to  ensure  the  correct  order  between  any  number  of  parallel  remote  accesses. 

We  have  chosen  to  locate  a  lock  service  at  SERV,  and  to  lock  only  the  parity 
units  involved.  Therefore,  the  distribution  of  locks  will  follow  the  same  mapping 
as  those  of  parity  units.  This  means  three  things:  a)  this  distributed  consensus 
will  ensure  per  stripe-unit  consistency,  b)  this  will  not  suppose  a  bigger  bottle¬ 
neck  than  the  access  to  the  parity  unit  itself  and  c)  there  will  also  be  a  unified 
distributed  consensus  on  the  new  lock  server  to  use  in  case  that  the  device  goes 
to  degraded  state. 

Improving  Performance  Depending  on  its  size,  an  I/O  action  could  correspond 
to  a  huge  number  of  subactions  over  a  (possibly  sparse)  set  of  individual  storage 
units  of  the  underlying  devices  (i.e.  Fig.  2).  In  order  to  reduce  the  amount 
of  individual  subactions  and  to  optimize  underlying  device  access,  this  set  is 
reordered  by  joining  subactions  that  are  logically  contiguous;  1)  they  refer  to 
the  same  underlying  storage  device,  2)  they  are  of  the  same  action  type  (lock, 
read,  xor,  write  or  unlock)  and  3)  they  concern  to  a  set  of  contiguous  units. 
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Device  logical  view 


Raid  level  5  mapping 
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Fig.  2.  Raid  level  5.  Write  from  4  to  12  decomposition 


The  resultant  set  of  actions  ordered  is  processed  running  in  parallel  actions 
for  each  device,  but  doing  it  in  the  following  order:  all  locks,  all  reads,  the  internal 
xor  calculation,  all  writes  and  finally  the  unlocks.  This  method  has  the  following 
properties;  a)  ensures  consistency  between  data  and  parity  of  each  concerned 
stripe-unit,  b)  minimizes  the  final  number  of  actions  and  therefore,  (in  the  case 
of  VRAID)  the  network  traffic,  b)  the  final  per  device  action  is  more  compact 
and  could  be  done  faster. 


2  System  and  Workload  Characterization 

All  the  performance  analyses  in  this  paper  have  been  made  over  a  simulation  of 
a  massively  parallel  machine  characterized  as  shown  in  table  1.  The  File  System 
is  feed  by  workers  distributed  over  the  nodes  in  a  round  robin  way.  Each  worker 
executes  I/O  operations  continuously  from  the  selected  synthetic  workload  (Tab. 
2).  We  use  enough  workers  to  make  the  system  to  perform  at  its  limit. 

We  have  done  experiments  in  order  to  determine  the  system  scalability  and 
its  behavior  on  different  combinations  of  redundancy  levels  and  VRAID  states 
(fault-free,  degradated  and  during  the  reconstruction  phase). 


Table  1.  Systems  Evaluation  Parameters 

Network  crossbar  topology  with  100  MB/s  links 
Nodes  2  to  32  (plus  one  for  VRAID  type  4  or  5) 

VRAID  Levels  0,  4  and  5.  Unit  of  64KB  or  4KB  for  OLPT 

RAID  Levels  0,  4  and  5.  Unit  of  4KB. 

With  4  disks  (5  for  levels  4  and  5) 

Disks  ’’Seagate  EliteS”,  2627  cylinders  ♦  21  tracks  *  99  sectors 

5400  RPM  and  seek  times  1.7  min.,  11.0  avr.  and  22.5  max.  (ms) 
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Table  2.  Synthetic  Workloads  Parameters 


OLPT 


SSIM 


Online  Transaction  Processing  [6,11], 
80%  reads  of  4KB,  16%  writes  of  4KB, 
2%  reads  of  24KB,  2%  writes  24KB. 
All  uniformly  distributed. 


Scientific  Simulation. 

50%  sequential  IMB  accesses  to  one  100MB  file 
(90%  reads,  10%  writes) 

50%  uniform  512KB  accesses  to  10  5MB  files 
(10%  reads,  90%  writes) 


3  Results  Analysis 

In  Fig.  3  we  show  comparative  performance  for  different  system  sizes  running 
with  VRAID  level  5  in  fault-free,  degradated  and  recovery  states. 


Fig.  3.  Performance  in  Different  VRAID  States 


We  observe  that  the  performance  in  degraded  state  shows  a  better  scalability 
for  OLPT  than  for  SSIM.  Whereas  the  overhead  of  degraded  accesses  grows  with 
the  number  of  involved  nodes,  the  probability  that  an  OLPT  operation  does  not 
concerns  the  failed  node  also  grows.  This  is  not  true  for  SSIM  accesses,  that 
affect  all  nodes,  so  for  each  write,  a  previous  read  of  the  parity  information  is 
needed. 

During  the  reconstruction  phase  one  special  worker  recovers  the  failed  device. 
This  implies  an  added  overhead.  To  improve  performance  recovery  is  done  in 
chunks  which  are  put  to  normal  service  as  soon  as  recovered.  As  Fig.  3  shows 
the  mean  bandwidth  during  recovery  phase  is  improved  over  the  degraded  one. 


648 


VECPAR'98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


F8  bafitfwMttforOLPTworWotd  FI  tafiAvMh  for  MIM  worUetd 


VRAJDO  VRAID5  VPAID4  VRAJOO  VRAID4  VRA105 


Fig.  4.  Comparative  Performance  for  Different  Redundancy  Models 


In  Fig.  4  we  show  a  32  node  system  with  different  models  of  redundancy  both 
for  VRAID  and  Raids. 

Our  results  show  that  SSIM  workload  gives  around  160MB/s  peek  bandwidth 
whereas  OLPT  gives  25MB/s.  SSIM  is  not  affected  very  much  by  the  redundancy- 
model,  because  large  operations  involving  contiguous  blocks  on  all  disks,  are  done 
much  more  efficiently.  Obviously  VRAID  level  0  gives  the  best  bandwidth,  but 
does  not  protect  us  from  a  node  failure,  it  is  given  for  comparison. 

The  OLPT  workload  involves  very  small  size  operations  (4KB  and  24KB), 
so  the  redundancy  management  overhead  is  more  significant  than  in  SSIM.  Nev¬ 
ertheless  combination  VRAID  5  -  Raids  4  has  very  similar  performance  than 
VRAID  5  -  Raids  5. 


4  Conclusions  and  Future  Work 


Given  the  observed  system  behavior,  we  can  conclude  that  the  systems  scales 
very  well,  and  systems  of  128  nodes  or  more  are  possible.  For  small  systems  (32 
nodes  or  so)  we  suggest  configurations  with  VRAID  level  5  and  Raids  level  0, 
this  allows  for  the  same  recovery  procedure  from  a  node  or  a  disk  failure. 

The  recovery  time  for  a  disk  failure  using  the  VRAID  redundancy  at  is  im¬ 
practical  in  larger  systems.  Therefore,  we  suggest  the  use  of  level  5  redundancy 
at  both  VRAID  and  Raids  levels. 

M  e  are  now  including  the  effect  of  different  caching  alternatives  on  the  above 
results. 
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Abstract.  In  this  paper  we  propose  a  system  with  an  architecture  capable  of 
parallel  processing.  Also,  due  to  its  computational  power,  the  system  is  able 
to  handle  complex  algorithms.  This  structure  is  applied  to  an  AC  motor 
vector  control  system,  formed  by  two  control  loops  which  are  running 
simultaneously:  a  speed  control  loop  and  a  motor  model  parameters  (needed 
by  the  speed  controller)  identification  loop.  This  architecture  allows  the 
experimentation  of  new  control  algorithms  in  this  field.  Some  results  are 
presented  that  show  the  system’s  performance. 


1  Introduction 


The  new  control  algorithms  experimentation  requires  the  availability  of  a  system 
with  an  architecture  that  allows  the  easy  reprogramming  of  their  elements  separately 
and  the  execution  of  complex  algorithms,  which  can  be  executed  simultaneously. 
Also,  many  industrial  controls  are  based  on  a  multi  processor  architecture,  that  use 
two  or  more  low  cost  processors  instead  of  one  complex  (expensive)  processor.  In 
this  paper  we  present  an  architecture  based  in  two  processors  that  will  allow  the 
experimentation  of  the  control  techniques  that  we  have  previously  studied 
analytically  and/or  simulated,  which  will  be  afterwards  implemented  in  a 
multiprocessor  configuration. 


2  Proposed  system  architecture 


The  implemented  system  architecture  is  shown  in  figure  1.  The  system  has  two 
processors:  a  486  (PC)  and  a  Digital  Signal  Processor  (32  bits  floating  point  DSP). 
The  DSP  is  placed  in  a  PC  ISA  bus  slot,  which  acts  as  the  physical  interface.  The 
data  exchange  between  the  two  processors  is  done  using  a  Dual  Port  RAM 
(DPRAM),  which  can  be  accessed  simultaneously  by  both  processors.  The  DPRAM 
allows  fast  information  exchange  between  the  PC  and  DSP  without  disrupting  the 
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processing  ot  either  device.  If  it  is  necessary,  the  DSP  is  able  to  interrupt  the  PC  by 
means  of  the  IRQ3  line;  also  the  PC  can  interrupt  the  DSP  using  one  of  its  four 
interrupt  lines  (INT3).  The  PC  is  able  to  control  and  monitor  the  DSP  by  means  of 
an  I/O  mapped  interface.  The  communication  of  the  system  with  the  external  world 
is  done  by  means  of  the  following  devices,  which  are  connected  to  the  DSP:  an  A/D 
converter  module  with  four  16  bit  channels,  with  a  maximum  sampling  speed  of  50 
kHz,  and  a  digital  I/O  board,  with  32  user  configurable  I/O  channels.  This 
configuration  is  clearly  being  used  in  many  fields  [1],  The  PC  is  programmed  using 
C  language  (Borland  C).  The  DSP  is  programmed  using  either  Assembler  and  C 
language.  In  the  latter,  the  routines  that  are  time  critical  are  programmed  using 
Assembler  to  control  precisely  the  execution  time. 

PC/C32 


Fig.  1.  System  architecture 


3  AC  motor  control  system 

The  block  diagram  of  the  AC  motor  adaptive  vector  control  system  that  we  have 
implemented  is  shown  in  figure  2.  This  control  system  has  two  loops  which  are 
running  simultaneously:  the  speed  control  loop,  that  actually  controls  the  motor 
speed,  and  the  parameters  identification  loop  that  tunes  the  FAM  controller 
parameters. 


3.1  Speed  control  loop 

This  loop  controls  the  AC  motor  speed.  It  has  the  following  elements: 

Speed  Controller.  It  computes  the  torque  setpoint  (T)  from  the  speed  error  (Ew). 
This  controller  algorithm  has  been  implemented  using  fuzzy  logic,  due  to  its  major 
robustness  faced  by  system  changes  (inertia,  load)  [2]. 
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FAM  Controller.  It  computes  the  voltage  (V)  in  amplitude,  phase  and  frequency 
that  has  to  be  applied  to  the  AC  motor  from  the  torque  setpoint  (T).  It  uses  the  Field 
Acceleration  Method,  that  maintains  the  motor  magnetising  flux  constant,  thus 
avoiding  electromagnetic  transients.  To  achieve  this  it  is  necessary  to  tune  the  FAM 
controller  parameters  precisely  in  accordance  with  the  AC  motor  [3],  which  is 
performed  by  the  other  loop. 

Inverter  Controller.  It  generates  every  100  ps  the  control  signals  for  the  inverter 
gates  from  the  desired  voltage  (amplitude,  phase,  frequency).  Is  based  in  a  vector 
modulation  algorithm  that  takes  into  account  the  necessary  inverter  dead  times,  and 
it  uses  an  accumulated  error  algorithm  to  improve  its  performance  (harmonic 
distortion). 

Inverter.  It  is  the  power  device  that  supplies  the  voltage  and  current  consumed  by 
the  AC  motor.  This  device  includes  the  logic  necessary  to  protect  it  from 
overvoltages  and  overcurrents. 

AC  motor.  It  is  the  machine  whose  speed  (and  torque)  we  control. 

PC  DSP 


Fig.  2.  AC  motor  control  system,  formed  by  two  loops 


3.2  Parameters  identiflcation  loop 

This  loop  modifies  the  FAM  controller  parameters.  It  is  formed  basically  by  the 
Model  Reference  Adaptive  Controller  (MRAC),  which  is  the  block  that  performs  the 
parameters  identification  that  the  FAM  controller  needs,  by  means  of  an  algorithm 
programmed  using  fuzzy  logic.  To  perform  this  task  the  MRAC  controller  compares 
the  intensity  that  the  AC  motor  consumes  with  that  estimated  by  the  FAM  model;  as 
a  result  of  the  comparison,  an  amplitude  and  phase  error  are  obtained,  from  which 
the  MRAC  algorithm  calculates  the  parameters’  new  values.  This  algorithm  has 
been  programmed  from  the  study  of  the  parameters  variation  effect  over  the 
amplitude  and  phase  intensity  consumed  by  the  motor. 
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3.3  Tasks  assignment 

The  tasks  assignment  is  presented  in  figure  2.  As  can  be  seen,  the  DSP  executes  the 
inverter  controller  and  the  MRAC  controller  algorithms.  The  DSP  main  task  is  the 
MRAC  controller,  and  is  interrupted  every  100  ps  by  the  inverter  controller 
algorithm,  whose  output  signals  can’t  be  delayed.  The  486  executes  the  speed 
controller  and  the  FAM  controller  algorithms,  monitors  all  the  system  and  stores 
system  variables  (speed,  torque,  voltage,...).  The  AC  motor  speed  and  the  current 
consumed  are  acquired  using  the  A/D  acquisition  board.  The  control  signals  for  the 
inverter  gates  are  generated  using  7  lines  (6  gates,  1  enable)  of  the  digital  I/O  board. 
With  this  task  assignment,  the  486  discharges  the  DSP  computing  load,  allowing  the 
experimentation  of  more  complex  algorithms. 


3.4  Data  exchange 

The  data  exchange  can  be  easily  made  by  means  of  the  DPRAM.  The  DSP  provides 
the  PC  with  the  motor  speed  acquired  by  the  A/D  converter  module,  and  the  new 
AC  motor  parameters  obtained  by  the  MRAC  controller.  The  PC  provides  the  DSP 
with  the  desired  voltage  (amplitude,  phase,  frequency)  that  has  to  be  applied  to  the 
motor.  As  one  processor  writes  to  the  DPRAM  without  interrupting  the  other,  this 
data  exchange  is  made  with  no  interaction  between  them. 


4  Results:  discussion  of  performance 

To  demonstrate  system’s  performance,  we  have  studied  the  control  system’s 
response  to  a  ramp,  using  the  FAM  controller  with  its  parameters  not  properly  tuned 
(stator  and  rotor  resistance,  Rs  and  Rr  respectively).  In  these  experiments,  the 
parameter  identification  loop  (MRAC  controller)  is  tuning  the  model  parameters 
(_Rs,  _Rr)  that  the  FAM  controller  uses,  meanwhile  the  speed  control  loop  is 
controlling  the  motor  speed.  As  we  can  see  from  the  graphical  results  (figure  3),  the 
MRAC  controller  tunes  the  model  parameters  (_R^,  _R,)  to  the  real  ones  (R^,  R,)  in  a 
few  seconds.  It  works  properly  even  during  transients  in  the  speed  control  system. 
Furthermore,  the  parameters  identification  loop  improves  the  system’s  performance, 
because  it  obtains  the  real  AC  motor  parameters  that  the  FAM  controller  needs.  As 
we  mentioned  before,  the  parameters  identification  algorithm  compares  the  real 
current  consumed  by  the  AC  motor  with  that  one  estimated  using  the  model.  In 
order  to  measure  the  phase  of  the  real  intensity,  a  zero-pass  detection  circuit  is  used, 
which  interrupts  the  system  every  cycle.  This  means  that,  at  most,  is  possible  to 
execute  an  identification  cycle  each  period  of  the  power  supply  signal.  If  we  use  a 
single  processor,  the  system  won’t  be  able  to  execute  the  parameters  identification 
algorithm  so  often,  while  is  executing  all  the  other  control  routines  (that  have  to  be 
executed  to  avoid  the  degradation  of  the  control  system  performance),  and  the 
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identification  time  will  be  longer  compared  with  that  one  of  a  parallel  processing 
system  (figure  4). 


Fig.  3.  Speed  (ft))  and  torque  (7)  control 
identification  (R ,  R^)  with  two  processors 


Fig.  4.  Speed  (ft))  and  torque  (7)  control 
identification  (R ,  R^)  with  one  processor 


5  Conclusions 


system  response  to  a  ramp  during  parameters 


system  response  to  a  ramp  during  parameters 


We  have  presented  a  parallel  processing  architecture  with  two  processors  running 
simultaneously:  a  486  (PC)  and  a  DSP.  The  latter  is  placed  in  the  ISA  bus,  giving  an 
interface  with  enough  immunity  to  conducted  and  radiated  interferences.  This 
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architecture  solves  the  data  exchange  between  processors  and  allows  the 
experimentation  ot  an  AC  motor  control  system  with  two  loops  that  have  to  be 
executed  simultaneously:  the  speed  control  loop  and  the  parameter  identification 
loop.  The  main  advantatge  of  this  system  is  that  we  can  reprogram  the  algorithms 
that  one  processor  executes  without  changing  the  ones  executed  by  the  other 
processor.  Also  the  system  is  capable  of  acquiring  external  signals  (current 
consumed  by  the  AC  motor,  DC  bus  voltage)  and  generating  digital  output  signals 
(inverter  control).  The  results  presented  show  that  the  system  formed  by  the  two 
processors  is  able  to  control  the  AC  motor  speed  and,  simultaneously,  tune  the 
motor  model  parameters  used  by  the  FAM  controller. 
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Abstract.  This  paper  presents  the  parallelization  of  a  non  linear  non 
constrained  optimization  code  used  in  a  industrial  design,  two  different 
approaches  are  presented  and  the  results  of  the  comparison  is  shown. 
Keywords:  Non  lineaj  optimisation,  Parallel  Algorithms,  Lens  Design, 
Parallel  Linear  Solvers. 


1  Introduction 

In  this  paper  we  will  discuss  an  industrial  design  problem,  we  will  show  the 
difficulties  encountered  and  why  a  parallel  approach  was  needed.  Furthermore 
the  parallel  algorithm  will  be  described,  and  the  performance  obtained  also  will 
be  presented. 

Industrias  de  Optica  S.A.  is  the  biggest  Spanish  lens  manufacturer,  the  flag¬ 
ship  product  of  the  company  is  the  progressive  lens.  This  kind  of  lens  is  used  to 
compensate  the  presbiopya,  resulting  from  the  aging  of  the  eye.  This  product  is 
growing  its  market  share. 

A  progressive  lens  has  three  different  vision  zones,  in  one  of  them  the  user 
can  see  distant  objects,  in  the  second  (intermediate  vision  zone)  a  progressive 
change  of  optical  power  is  made  in  order  to  allow  the  wearer  see  all  distances. 
The  last  zone  is  used  in  near  vision.  It  is  known  that  there  is  no  analytical 
solution  that  gives  the  best  possible  progressive  lens,  so  it  is  mandatory  to  use 
an  optimization  algorithm.  [2] 

In  addition  to  these  three  zones,  used  in  phoveal  vision,  there  is  a  fourth 
zone,  the  lateral  zone.  All  the  effort  in  the  optimisation  process  is  devoted  in 
reducing  the  astigmatism  in  this  zone,  improving  the  overall  lens  performance. 
In  figure  1  the  different  zones  can  be  observed. 

In  the  Progressive  Addition  Lens  design  process,  it  is  necessary  to  optimize 
the  lens  surface  in  every  performed  trial.  This  being  an  iterative  process,  it  is 
very  important  to  use  the  fastest  possible  algorithm.  This  is  the  motive  that  led 
us  to  a  parallel  approach. 

Also  in  Industrias  de  Optica  S.A. 
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Pig.  1.  Progressive  Addition  Lens  vision  zones 


2  Mathematical  approach 


2.1 


hens  Modeling 
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With  this  modification  the  calculation  time  for  a  Hessian  waa  reduced  enough 
to  make  the  Newton  algorithm  preferable  to  a  Quasi-Newton  approach.  [6]  [4] 

3  Parallelization  approaches 

The  targeted  platform  was  a  workstation  cluster,  so  we  choose  PVM  as  the  mes¬ 
sage  passing  environment  for  the  new  application.  [8] 

In  order  to  start  the  parallelization  a  profile  of  the  algorithm  sequential  version 
was  performed.  As  a  result  of  this  profile  it  was  clear  that  the  biggest  part  of 
the  CPU  time  was  spent  on  building  the  Hessian.  Those  routines  where  the  first 
ones  to  be  parallelized.  With  the  first  parallel  program,  performance  measure¬ 
ments  were  done  to  study  its  behaviour.  We  used  the  analysis  tools  available 
on  the  CEPBA  (European  Center  for  Parallel  Computing  of  Barcelona)  [7],  the 
Dimemas  and  Paraver  tools,  to  perform  those  tests. 


3.1  Objective  Function  Parallelization 

The  numerical  test  revealed  that  a  very  important  part  of  the  calculation  time 
was  spent  in  computing  the  objective  function.  Furthermore,  the  most  impor¬ 
tant  part  is  the  Hessian  computation.  So,  the  first  parallel  approach  faced  the 
reduction  of  this  time. 

The  Hessian  is  computed  by  finite  differences  of  the  gradient.  In  order  to 
improve  the  performance,  an  analytical  gradient  routine  was  implemented.  It  is 
notable  that  mathematical  packages  like  Mathematica  or  Maple  failed  to  com¬ 
pute  this  analytic  derivative. 

In  order  to  obtain  a  finite  difference  Hessian  approach,  it  is  necessary  to  cal¬ 
culate  n-f- 1  (  n  is  the  problem  dimension)  function  gradients.  Those  calculations 
are  independent,  so  they  are  splitt  among  the  different  available  processors.  A 
master-slave  approach  is  used.  The  other  computations  needed  by  the  algorithm, 
the  linear  search  and  the  linear  equations  system,  are  computed  by  the  master. 
In  table  1  the  speed-up  results  of  different  problem  sizes  are  shown.  The  tests 
were  performed  for  2,4,8,12  and  16  processors  in  order  to  study  the  algorithm 
scalability. 

Studying  the  code  and  profiles,  it  was  clear  that  the  algorithm  bottleneck 
was  the  linear  solver.  The  traces  obtained  in  our  performance  analysis  tool  cor¬ 
roborate  this  conclusion.  In  order  to  improve  the  scalability,  the  parallelization 
of  the  linear  system  solver  was  decided  upon. 


3.2  Linear  Solver  Parallelization 

In  order  to  achieve  a  better  scalability  we  parallelized  the  linear  solver.  We  used 
preconditioned  Krylov  subspace  iterative  methods  as  linear  solvers  (Conjugate 
Gradient  and  GMRES(m)).  The  selected  preconditioners  are  a  set  of  different 
Incomplete  Factorizations.  The  parallelization  of  the  linear  solvers  is  based  on  a 
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Table  1.  Parallel  Speed-Up  with  Function  Parallelisation 


Dimension  2  Proc  4  Proc  8  Proc  12  Proc  16  Proc 


70 

1.69 

2.63 

3.75 

3.38 

3.95 

140 

1.85 

3.16 

5.00 

6.13 

6.96 

390 

1.92 

3.56 

6.12 

8.10 

9.18 

1390 

1.95 

3.78 

6.90 

9.08 

12.39 

Domain  Decomposition  data  distribution.  [1]  The  main  bottleneck  of  the  linear 
solver  is  the  solution  of  the  sparse  triangular  linear  system  arising  from  the  pre¬ 
conditioner.  The  communication  requirements  of  this  operation  depend  on  the 
block  structure  of  the  triangular  factors.  In  order  to  minimize  this  bottleneck 
two  strategies  are  used: 


1.  Control  the  fill-m  at  the  block  level  with  a  different  criteria  than  at  the 
element  level. 

2.  Perform  a  coloring  of  the  domains  which  minimizes  the  fill-in  at  the  block 
level  and  ensures  the  maximum  parallelism. 


Because  the  granularity  of  the  Hessian  assembly  and  the  linear  solver  is  quite 
ditterent,  we  use  a  different  number  of  processes  in  each  phase.  This  means  that 
additional  communications  are  required  to  redistribute  the  data  before  and  after 
the  linear  system  solution  phase.  We  must  find  for  each  problem  size  the  optimum 
number  of  processes  of  each  part  in  order  to  obtain  the  minimum  execution  time 
in  this  way  we  can  improve  the  scalability  of  the  whole  application. 

The  results  are  shown  in  table  2  and  table  3.  The  results  with  the  smaller  data 
sets  are  not  shown  because  due  to  their  size  they  did  not  achieve  any  reasonable 


Table  2.  Parallel  Speed-Up  with  Function  and  Linear  Solver  Parallelisation 
processors  in  the  Linear  Solver. 


.  Using  2 


Dimension  2  Proc  4  Proc  8  Proc 

12  Proc  16  Proc 

390  1.84  2.91  4.08 

4.63 

4.20 

1390  1.97  3.58  6.06 

7.88 

9.27 

Surprisingly,  we  achieve  no  increases  in  speed  in  parallelising  the  linear  solver. 
Analysing  the  results  and  the  code,  we  find  two  reasons  for  this  behaviour: 
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Table  3.  Parallel  Speed-Up  with  Function  and  Linear  Solver  Parallelisation.  Using  4 
processors  in  the  Linear  Solver. 


Dimension  2  Proc  4  Proc  8  Proc  12  Proc  16  Proc 

390  1.73  2.58  3.45  4.02  3.34 

1390  No  convergence 


—  As  we  use  an  iterative  method,  the  number  of  iterations  needed  in  order 
to  solve  the  linear  system  is  a  key  parameter.  The  parallelisation,  involv¬ 
ing  a  matrix  reordering  increased  the  number  of  iterations.  Futhermore,  in 
the  bigger  case  (when  we  expected  some  performance  improvements),  the 
reordering  affected  the  algorithm  convergence  in  such  a  way  that  made  it 
diverge. 

-  With  the  solver  parallelisation,  the  number  of  communications  is  greatly- 
increased.  In  the  Hessian  parallelisation  there  are  two  communications,  at 
the  beginning  and  the  end  of  the  parallel  phase.  With  the  linear  solver,  there 
is  comunication  in  each  linear  solver  iteration. 

Summarising,  the  linear  system  involved  in  the  optimisation  algorithm  is  too 
small  and  too  badly  conditioned  to  be  solved  with  a  parallel  iterative  method. 


4  Conclusion  and  Future  Work 

The  speed-ups  obtained  are  satisfactory  for  the  industrial  process.  It  is  not  ex¬ 
pected  to  use  more  than  12  machines  at  the  same  time.  In  fact  INDO  is  installing 
a  network  of  6  DEC  Alpha  workstation  with  a  Fast  Ethernet  switch.  Taking  the 
previous  results  into  account,  with  the  targeted  platform,  the  first  parallel  ap¬ 
proach  is  the  most  suitable  for  the  company. 

It  is  also  interesting  to  remark  that  the  problems  with  the  parallel  linear 
solver.  In  our  previous  experience  with  linear  systems  from  numerical  simulations 
we  have  never  found  such  a  bad  conditioned  problem.  In  order  to  overcome  this 
behaviour  we  are  thinking  about  new  reordering  methods. 

The  future  work  includes  an  upgrade  of  the  basic  sequential  algorithm,  and 
the  changes  needed  by  this  improved  approach.  We  also  want  to  study  the  pos¬ 
sibilities  of  Quasi-Newton  approaches  to  our  problem. 
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Abstract  Finite  Difference  Time  Domain  (FD-TD)  is  a  numerical  technique 
widely  used  to  evaluate  the  electromagnetic  field  distribution  in 
geometrically  complicated  devices.  The  explicit  formulation  and  the 
intrinsic  parallel  structure  of  the  FD-TD  algorithm  suggest  the  possibility  to 
increase  the  code  performance,  particularly  in  terms  of  computation  time 
reduction,  using  parallel  architectures.  In  this  paper,  advantages  in  the 
design  process  of  domestic  microwave  ovens  via  FD-TD  on  massively 
parallel  computers  are  described  and  commented.  Comparisons  between  the 
simulation  times  required  using  different  workstations  and  the  Cray-T3D 
parallel  computer  are  finally  reported. 


1  Introduction 

In  the  design  of  microwave  ovens,  overall  performances  in  terms  of  heating 
uniformity  of  the  load  and  energy  conversion  efficiency,  user’s  safety  and  device  cost 
reduction  must  be  taken  into  account  and  optimized.  The  availability  of  a  CAD  tool  is 
fiindamental  for  oven  designers.  In  fact,  this  allows  not  only  to  obtain  improvements 
in  heating  uniformity  and  efficiency,  but  also  to  prevent  possible  microwave  leakage 
and  abnormal  heating  or  arcing  in  the  feeding  system. 

The  Finite  Difference  Time  Domain  (FD-TD)  method  is  a  numerical  technique  that 
can  be  profitably  used  to  investigate  the  electromagnetic  (e.m.)  behavior  of  a 
microwave  heating  applicator  [1]  [2].  Because  of  the  complexity  of  the  overall 
equations,  and  also  the  generally  complicated  geometry  of  the  heating  devices,  the 
determination  of  the  e.m.  field  distribution  inside  the  oven  could  require  many  days  of 
simulation  on  ordinary  Personal  Computers  or  Workstations  [3].  To  reduce  the 
mathematical  dimensions  of  the  problem,  some  approximations  can  be  taken  into 
account,  but  this  could  introduce  unacceptable  loss  of  accuracy. 

This  bottleneck  can  be  overcome  using  modem  parallel  computers.  However,  to 
obtain  the  best  results  from  this  architecture,  the  simulation  code  must  converted  in 
parallel  form  and  correctly  optimized.  The  FD-TD  approach  [4],  being  based  on 
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explicit  formulation  with  an  intrinsic  parallel  structure  of  the  solving  equations  is 
well  suited  to  take  full  advantage  on  this  kind  of  architecture. 

introduction  to  the  algorithm,  the  code  parallellization 
ill  ^  discussed  and  Its  performances  presented,  showing  the  computaUon  time 
reduction  obtain^  on  the  CINECA’s  128  processors  Cray-T3D  system. 

This  progTM  will  be  used  as  the  basic  kernel  for  an  European  Community  HPCN 
project  a  demonstration  action  devoted  to  the  introduction  of  High  Performance 

domestic  microwave  ovens.  The  project  is 
named  POPCOIW  (Production  Of  Parallel  Computer  Optimized  micRowave  oveNs) 
and  is  managed  by  a  consortium  composed  by  De’  Longhi,  CINECA  and  D.E.I.S. 

2  The  numerical  approach 

T^e  electromagnetic  field  inside  a  metallic  microwave  cavity  representing  the  oven  has 
^n  descnbed  by  the  Time  Domain  Maxwell’s  curl  Equations.  Differential  operators 
have  been  wntten  in  difference  form  following  the  Yee’s  scheme  [4].  The  resulting 
equations  for  all  the  6  field  components  (electric  and  magnetic)  have  the  same  foim 
and  d^er  only  from  the  values  of  the  multiplication  coefficients,  that  are  evaluated 
^cording  to  the  dielectnc  properties  of  materials  in  each  cell  of  the  computational 
domain.  As  an  example,  the  equation  of  the  E,  field  component  can  be  written  as; 

ET%J,k)  =  C,{i,j,k)E:{i,j,k)  + 

where  n  is  the  iteration  time  step,  (i,  j,  k)  represents  the  generic  node  of  the  discrete 
computational  domain  and  the  coefficients  C;,  C,  and  C,  are  functions  of  both  the 
local  values  of  the  dielectnc  properties  and  the  spatial  step  increments  along  y  and  z 
These  equations  are  well  suited  to  be  solved  on  a  parallel  computer.  In  fact,  as  it  is 
easy  to  observe,  the  three  electric  field  components  do  not  depend  from  each  other,  but 
are  only  ffinctions  of  the  previous  value  of  themselves  in  the  same  cell  and  of  the 
magnetic  field  components  in  the  surrounding  cells.  A  similar  result  holds  also  for  all 
the  three  magnetic  field  equations. 


3  The  parallel  implementation 

The  Cray-T3D  is  a  massively  parallel  system  that  integrates  commodity 
microprocessors  with  a  proprietary  system  interconnection  network  and  high-speed 
synchronization  mechanisms.  Each  Processing  Element  (PE)  consists  of  a  processor, 
e  associated  logic  and  a  connection  to  the  interprocessor  communication  network 
The  processor  is  a  DEC  Alpha  chip  21064,  a  64-bit  RISC  architecture  with  dual- 
issue,  pipelining  instruction  stream,  that  provides  150  Mflop/s  peak  performance. 
Each  PE  IS  equipped  with  a  direct-mapped  cache  of  8  Kbyte  for  the  data,  and  with  a 
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DRAM  local  memory  of  8  Mwords  (64-bit  words).  The  global  memory  subsystem  is 
a  directly  connected  shared  distributed  memory  architecture  in  which  memory  is 
globally  addressable  but  physically  distributed.  The  interconnection  network  is  a  3D 
torus  which  operates  asynchronously  and  independently  from  the  PEs  to  access  and 
redistribute  global  data.  The  3D  torus  topology  ensures  short  connection  paths  and 
high  bisection  bandwidth(  3(X)  Mbytes/s  in  every  direction). 

The  original  FD-TD  code  has  been  parallelized  on  the  Cray-T3D  using  the  CRAFT 
work  sharing  paradigm.  A  preliminary  version  of  the  parallelization  scheme  is  reported 
in  [5].  CRAFT  is  a  Cray  proprietary  parallel  programming  model,  similar  to  HPF, 
that  allows  the  use  of  a  global  address  space  and  supports  the  SPMD  (Single  Prograin 
Multiple  Data)  programming  style.  The  same  program  is  loaded  and  executed  in  all  the 
PEs,  but  controlled  by  processor  number  and  data.  CRAFT  is  based  on  directives  to 
the  Fortran  compiler,  to  express  data  and  work  distribution  among  the  PEs,  and  it  is 
efficient  and  easy  to  use.  Unfortunately  the  portability  is  restricted  only  to  the  Cray- 
T3D  massively  parallel  systems  [5].  One  of  the  tasks  of  the  POPCORN  Consortium 
is  to  overcome  this  limitation.  In  order  to  accomplish  this  task,  the  FD-TD  code  will 
be  parallelized  also  using  the  MPI  message  passing  paradigm,  a  more  general  and 
portable  parallel  programming  model  than  the  work  sharing  one.  In  this  way  the 
program  will  be  ported  on  different  parallel  architectures  to  investigate  the 
performances  that  can  be  reached  even  on  a  cluster  of  PC’s,  thus  making  this  tool 
practically  useful  for  the  Research  and  Development  division  of  an  industry. 


4  Results 

In  the  structure  of  the  developed  FD-TD  simulator  three  main  sections  can  be 
identified:  pre-processing,  field  evaluation  and  data  output. 

Data  input  and  initialization  of  all  variables  are  the  activities  of  the  first  section. 
Information  related  to  the  physical  structure  of  the  computational  domain  (dimensions, 
e.m.  properties  of  the  considered  materials,  used  mesh,  etc.)  are  obtained  reading  a 
binary  file  produced  by  an  external  program  used  for  the  modeling.  Then,  once  all  the 
dielectric  properties  of  each  mesh  point  are  known,  values  of  all  the  variables  used  for 
e.m.  field  evaluation  can  be  prepared.  The  second  section  contains  the  field 
computation  procedures,  based  on  the  Yee’s  algorithm  for  the  inner  domain  and 
boundary  conditions  for  the  outer  faces.  Also  field  excitations  is  performed  in  this 
section.  Output  binary  files  are  used  for  final  post-processing  procedures. 

As  an  example,  the  FD-TD  approach  has  been  used  to  simulate  the  behavior  of  a 
domestic  microwave  oven  represented  by  32  x  32  x  32  cells  and  for  a  temporal 
evolution  of  1000  time  steps.  Simply  adapting  the  existing  code  to  the  parallel 
machine,  we  have  observed  that  the  simulation  times  in  all  the  parallel  regions  scale 
very  well  with  the  number  of  the  used  processors.  However  the  global  performances 
are  always  limited  by  the  unoptimised  sequential  I/O  procedures,  which  shown  an 
almost  random  contribution  to  the  overall  simulation  time.  The  solution  to  this 
problem  has  been  obtained  modifying  the  I/O  routines,  increasing  the  number  of  data 
associated  to  each  I/O  request  (Fig.  1). 
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Fig.  1.  Comparisons  between  the  computation  times  (Log  scale)  requited  by  the  I/O 
procedures  of  the  parallel  FD-TD  simulator  before  and  after  the  optimizations. 
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Fig.  2^Computation  times  before  and  after  the  optimizations  of  the  field  evaluation  section 
using  different  number  of  PEs. 


Othw  opt^ations  have  also  been  introduced  to  increase  the  computation  speed  on 
the  Cray-T3D  parallel  computer.  This  has  been  done  modifying  the  data  structure  of 
the  coefficients  used  in  the  Yee’s  field  equations,  avoiding  the  so  called  cache  miss 
phenomenon.  With  this  solution  we  have  doubled,  in  terms  of  Mflop/s,  the 
performances  of  each  PE.  For  the  main  computational  part  of  the  code  the 
improvements  shown  in  Fig.  2  have  been  obtained. 

^e  speedup  of  each  section  of  the  code  as  a  function  of  the  used  PEs  is  reported  in 
Fig.  3.  This  speed-up  has  been  evaluated  as  the  ratio  between  the  simulation  time 
r^uired  to  perform  a  given  procedure  on  a  single  PE  and  the  time  required  to  perform 
the  same  p^  of  the  code  in  parallel.  As  it  is  possible  to  see,  parallel  procedures 
(Yees  coefficient  preparation  (PrepcY)  and  field  computations  (Calc))  scale 
accordingly  with  the  number  of  used  PEs,  confirming  the  good  implementation  of  the 
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code.  Loss  of  efficiency  results  for  operation  defined  over  data  subspaces  (as,  for 
example,  those  related  to  preparation  of  coefficients  for  the  boundary  conditions, 
indicated  as  PrepcB,  and  those  related  to  field  excitation  and  boundary  field  evaluation, 
which  influence  the  behavior  of  the  Calc  procedures).  The  resulting  performance, 
however,  can  be  considered  satisfactory. 


Fig.  3.  Speed-Up  of  the  different  procedures  of  the  FD-TD  simulator  vs  the  used  PEs. 


Timing-Ratio 


Fig.  4.  Timing  ratio  of  the  128  PEs  Cray-T3D  system  respect  the  same  system  with 
different  number  of  PE  and  some  SUN  workstations. 

Using  this  code,  the  behavior  of  a  more  complicated  microwave  domestic  oven  with 
different  load  situations  has  been  simulated  [6].  For  a  7500  time  step  run  of  a 
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64  X  64  X  64  mesh  and  128  PEs,  the  overall  CPU  time  has  been  reduced  to  215  s 
respect  to  the  about  9  h  and  3.5  h  required  respectively  by  a  Sun  SPARCStation  20 
and  a  Sun  ULTRA  1  workstations.  Obtained  Speed-up  are  reported  in  Fig.  4. 


5  Conclusions 

In  this  paper,  advantages  in  the  design  process  of  domestic  microwave  ovens  using 
massively  parallel  computers  have  been  described  and  commented.  Comparisons 
between  simulation  time  required  by  different  workstations  and  the  Cray-T3D  parallel 
computer  have  been  reported,  to  show  the  obtained  performance  increments.  Sp)eed-up 
of  59  and  154  have  been  shown  comparing  Cray-T3D  128  PE  and  Sun’s  ULTRAl  and 
SPARC20  workstation’s  results.  This  code  will  be  used  as  the  basic  kernel  for  the 
POPCORN  European  Community  project.  The  FD-TD  simulator  will  be  ported  to 
the  new  CINECA’s  Cray-T3E  parallel  computer  and  on  a  PC  cluster  using  the 
message  passing  paradigm  (MPI),  to  investigate  the  level  of  performance  that  can  be 
reached  and  to  make  this  tool  available  for  industrial  Research  and  Development 
divisions  engaged,  for  example,  in  domestic  microwave  oven  design. 
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Abstract.  The  paper  describes  a  general  purpose  tool  for  the  debugging  of 
message  passing  parallel  applications.  The  basic  components  of  this  tool  are  the 
trace/replay  mechanism,  the  graphical  user  interface  and  the  central  component, 
called  visualization  engine.  The  engine,  which  plays  the  central  role  during  the 
replay  phase,  can  be  used  with  different  message  passing  environments  and 
different  graphical  interfaces.  This  is  a  significant  step  to  ensure  a  wider  range 
of  usability.  Also  relevant  is  the  fact  that  this  engine  is  able  to  learn  how  to 
detect  predicates. 


1  Introduction 

Debugging  sequential  programs  is  not  an  easy  task  and  it  is  common  knowledge  that 
the  insertion  of  print  statements  is  one  of  the  most  popular  debugging  techniques. 

Henry  Lieberman  calls  debugging  “the  dirty  little  secret  of  computer  science”  and 
concludes  that  it  is  still,  largely,  a  matter  of  trial  and  error  [10].  The  fact  that  the  April 
97  issue  of  “Communications  of  the  ACM”  is  entirely  dedicated  to  debugging,  proves 
how  relevant  the  subject  is.  The  debugging  problem  has  largely  been  ignored  what 
contrasts  sharply  with  the  remarkable  progress  in  software  development  over  the  last 
thirty  years  [3]. 

Debugging  parallel  applications  is  even  more  difficult  than  debugging  sequential 
programs  due  to  non-determinism  caused  by  race  conditions.  These  conditions  happen 
since  processes  in  a  parallel  application  must  communicate  with  one  another. 

That  is  why  our  tool  focuses  on  communication  events.  The  tool  includes  a  replay 
mechanism  and  a  graphical  interface.  Between  these  two  components,  a  central 
component,  the  visualization  engine,  makes  the  tool  easily  adaptable  to  different 
message  passing  mechanisms  and  different  graphical  environments. 
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2  Comparing  Similar  Tools 


In  November  1993  a  group  named  Parallel  Tools  Consortium  '  was  established, 
whose  “mission  is  to  take  a  leadership  role  in  defining,  developing,  and  promoting 
parallel  tools  that  meet  the  specific  requirements  of  users  who  develop  scalable 
applications  on  a  variety  of  platforms”.  According  to  this  consortium,  parallel  program 
debuggers,  execution  trace  visualizers,  and  tools  for  performance  tuning,  are 
subgroups  that  form  a  larger  group  named  Execution  Analyzers.  Besides  this,  there  are 
two  more  groups;  Source  code  analyzers  which  are  used  to  analyze  and  convert  serial 
programs  to  parallel  code  and  Parallel  languages  and  libraries. 

The  usage  of  execution  analysis  tools  is  mandatory  for  programmers  to  obtain 
correct  and  tuned  parallel  programs  and  it  takes  place  after  the  usage  of  any  tool  from 
the  other  groups.  Among  those,  debuggers  have  to  be  used  before  execution  trace 
visualizers  and  tools  for  performance  tuning.  There  are  myriades  of  tools  of  these 
sorts,  therefore,  one  can  only  mention  a  limited  number  of  them. 

Among  execution  trace  visualizers  and  tools  for  performance  tuning  we  can 
mention  AIMS',  mp2sddP,  ntv',  Pablo',  VP,  Paragraph  [6],  Forge',  XProfiler', 
Paradyn’,  PATOP\  Poet  [7].  The  following  belong  to  the  group  of  debuggint^  tools- 
xpdbx',  TotalView',  DETOP',  Xmdb'. 

Our  tool  is  intended  to  be  independent  of  the  message  passing  software.  However, 
it  is  being  tested  for  PVM  applications  so,  it  makes  sense  to  mention  execution 
analizers  exclusively  applicable  to  this  message  passing  system:  Xpvm',  Hence^ 
PVaniM  [11],  Xab3',  DBPVM',  TAPE/PVM',  DDBG  [4]  and  TOOL-SET  [12].  The 
last  one  comprises  a  set  of  integrated  tools,  among  them  the  debugger  DETOP  and  the 
performance  analyser  PATOP,  previously  mentioned. 

A  complete  description  of  all  these  tools  and  a  detailed  comparison  with  the  one 
described  here,  is  outside  the  scope  of  this  paper.  Nevertheless,  it  is  possible  to 
Identify  two  of  its  distinctive  features.  First,  it  incorporates  both  a  replay  mechanism 
and  a  graphical  representation,  and  second,  its  basic  component,  the  visualization 
engine,  builds  an  object-oriented  model  of  the  message  passing  application.  Taking 
full  advantage  of  inheritance  and  polymorphism,  the  tool  becomes  easily  adaptable  to 
different  message  passing  softwares  and/or  to  different  graphical  representations  or 
graphical  softwares. 

Besides,  due  to  the  adoption  of  the  object-oriented  paradigm,  the  tool  is  flexible 
enough  to  acquire  an  important  additional  skill:  predicate  detection. 


'  http://www.ptools.org 

'  Links  to  a  site  containing  information  about  this  too!  can  be  obtained  in 

http://www.tc.cornell.edu/Parallel.TooIs/exec-analysis-tools.html 

'  Links  to  a  site  containing  information  about  this  tool  can  be  obtained  in 

^  http;//www.cse.ogi.edu/DISC/proJects/mist/related-work/monitoring.html 

Links  to  a  site  containing  information  about  this  tool  can  be  obtained  in 
http://www.henceedp.com/ 


670 


VECPAR  '98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


3  Our  Tool 

As  explained  before,  the  tool  includes  three  components:  a  replay  mechanism,  a 
graphical  interface  and  a  central  component  named  visualization  engine. 

The  replay  mechanism  makes  a  particular  execution  repeatable,  allowing  cyclic 
debugging,  a  frequently  used  technique  in  sequential  programs.  The  replay  mechanism 
adopted  is  similar  to  the  one  described  in  [9]  for  applications  based  on  the  shared 
memory  paradigm.  Assuming  that  the  individual  processes  in  the  parallel  application 
do  not  contain  nondeterministic  statements,  this  mechanism  is  based  in  the  principle 
that  if  each  process  is  supplied  with  the  same  input  values,  in  the  same  order,  during 
successive  executions,  it  will  exhibit  the  same  behaviour  each  time.  The  mechanism 
includes  two  distinct  phases:  trace  phase  and  replay  phase.  In  the  trace  phase,  minimal 
information  is  stored  in  order  to  minimize  the  probe  effect.  Although  minimal,  the 
stored  information  is  enough  to  assure  that,  during  the  replay  phase  each  process  will 
consume  the  same  messages,  in  the  same  order. 

It  should  be  emphasized  that  it  is  not  necessary  to  modify  the  code  of  a  parallel 
application  to  use  this  debugging  tool.  The  monitoring  code  is  inserted  in  the  standard 
libraries  of  the  message  passing  software,  which  should  not  be  modified  by  the 
common  user.  In  the  trace  phase,  the  application  under  study  must  be  linked  with  one 
modified  library  (trace  library);  for  the  replay  phase,  it  must  be  linked  with  a  second 
modified  library  (replay  library). 

During  the  replay  phase,  the  visualization  engine  builds  an  object-oriented  model  of 
the  application.  The  model  provides  the  necessary  semantic  feedback  to  answer  most 
of  the  questions  the  user  may  ask  about  the  application,  during  and  after  replay. 

The  engine  contains  two  sorts  of  classes:  classes  that  define  the  building  blocks  of 
the  model  (namely,  class  Process  and  class  Message)  and  management  classes.  In  this 
last  group,  the  most  important  classes  are  class  Manager  and  class  Agent. 

There  is  one  Agent  executing  in  each  machine  that  is  running  processes  of  the. 
replaying  application.  Each  Agent  receives  information  from  the  local  processes  and 
sends  it  to  the  object  in  charge  of  building  the  object-oriented  model  of  the 
application  and  maintaining  its  coherence  along  the  replay.  This  object  is  an  instance 
of  a  class  derived  from  Manager. 

In  order  to  support  different  graphical  representations  or  different  graphical 
environments  we  take  profit  of  inheritance,  a  major  property  of  object-oriented 
models.  A  Graphical  Interface  Manager  (GIManager),  derived  from  Manager, 
contains  the  knowledge  necessary  to  deal  with  the  graphical  interface.  Similarly,  the 
model  contains  classes  GIProcess,  derived  from  Process,  GIMessage,  derived  from 
Message  and  so  on. 

In  this  way,  data  and  code  that  depend  on  the  graphical  interface  are  encapsulated 
inside  GI  classes.  On  the  other  hand,  everything  that  depends  on  the  message  passing 
software  used  by  the  parallel  application,  is  encapsulated  inside  class  Agent.  Agents 
must  be  able  to  understand  the  message  passing  "dialect" . 

Inheritance  will  be  adopted  again,  this  time  to  teach  the  model  how  to  detect 
predicates  [2].  In  order  to  achieve  this  feature,  for  each  specific  predicate  new  specific 
classes,  subclasses  of  the  classes  in  the  model,  will  be  defined. 
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These  classes  inherit  the  behaviour  of  their  superclass(es)  and  additionally  know 
how  to  detect  that  predicate.  For  each  predicate,  particular  information  has  to  be 
collected  and  processed.  Therefore,  those  classes  must  contain  specific  attributes  and 
methods.  Some  of  these  methods  will  be  overridden  methods  giving  rise  to 
polymorphic  behaviour. 


Two  granularity  levels  for  the  observation  of  a  parallel  application  are  defined: 
Level  1 ;  external  events  level 

External  events,  that  is,  communication  events,  are  observable. 

Level  2:  internal  events  level 

Internal  events,  concerning  each  individual  process,  are  observable,  together  with 
communication  events. 

Our  tool  directly  supports  level  I.  However,  it  is  prepared  to  support  level  2  as 
long  as  a  sequential  debugger  is  integrated.  This  kind  of  integration  has  been 
accomplished  in  similar  tools  [4]. 

A  message  has  a  source,  one  or  several  destinations',  a  tag  and  a  body 

l^Jel  T  art^’ 

-  Bugs  concerning  one  message 
-  On  the  source  side 
wrong  destination; 
wrong  tag: 
wrong  body. 


-  On  the  destination  side 
wrong  source; 
wrong  tag. 


-  Bugs  concerning  all  messages 
race  conditions. 


-  Bugs  concerning  communication  primitives 
wrong  type  of  primitive. 

Each  of  the  following  examples  illustrates  one  of  the  previous  sort  of  bugs:  a 
process  disturbs  the  application’s  expected  behaviour  because  it  has  sent  a  messac^e  to 
the  wrong  destination  (this  one  is  a  bug  concerning  one  message,  on  the  source  side)-  a 
process  waits  for  a  message  that  will  never  arrive,  meanwhile  the  correct  message  has 
arrived  and  will  not  be  consumed  (this  is  a  bug  concerning  one  message,  on  the 
destination  side):  the  programmer  intended  to  develop  a  race-free  application  but,  in 
t^act  he  did  not  (this  is  a  bug  concerning  all  messages);  the  user  intended  to  use  a 
blocking  receive  and  instead  used  a  non-blocking  one  (this  is  a  bug  concerning 
communication  primitives). 


A  source  or  a  destination  is  a  process  identity. 
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With  level  I  tools,  detectable  predicates  are  those  properties  that  depend 
exclusively  on  variables  associated  with  communication  events.  For  instance,  suppose 
that  process  PI  processes  a  n-dimensional  matrix,  and  after  having  processed  each  line 
sends  it  to  process  P2;  the  property  “  has  process  P2  received  exactly  n  messages  from 
process  PI?”  is  a  detectable  one. 

General  predicates  will  be  detectable  as  long  as  the  level  2  of  granularity 
observation  is  guaranteed. 

The  tool  has  been  tested  with  PVM  applications  [1]  (PVM  [5]  supports  message¬ 
passing  paradigm);  C++  was  used  to  develop  the  visualization  engine  and  OSF-Motif 
for  the  graphical  interface. 

Although  our  debugging  tool  is  easily  adaptable  to  different  graphical  interfaces, 
we  have  started  with  a  rather  simple  representation,  the  time  space-diagram  [8].  We 
made  this  choice  because  we  think  that  a  complex  representation  disturbs  user’s 
attention.  He  spends  more  time  trying  to  understand  all  the  symbols  than  focusing  his 
mind  in  what  really  matters:  the  parallel  application. 
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Parallel  Ensemble- Averaged  Molecular  Dynamics 
Simulation  of  Shock  Wave  on  Distributed 
Memory  Multicomputers 
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Abstract.  In  this  paper  we  present  a  simple  parallel  algorithm  for  en¬ 
semble-averaged  molecular  dynamics  simulation  of  non-stationary  trans¬ 
port  processes  in  Lennard-Jones  systems  on  distributed  memory  MIMD 
multicomputers.  This  algorithm  has  been  used  for  simulation  of  shock 
wave  in  two-  and  three-dimensional  solids  and  calculations  of  ensemble- 
averaged  particle  distribution  functions  of  kinetic  and  potential  energy  as 
well  as  the  pair  correlation  functions  for  several  cross  sections  within  the 
shock  layer.  The  algorithm  is  based  on  parallel  simulation  of  independent 
systems  from  a  canonical  ensemble  on  different  processors  allowing  a 
computation  of  the  ensemble-averaged  structural  and  thermodynamic 
properties.  We  have  implemented  the  algorithm  in  the  PVM  program¬ 
ming  environment  and  performed  simulations  on  various  multicomputers. 

Keywords:  parallel  computing,  molecular  dynamics,  shock  wave,  PVM. 


1  Introduction 

The  molecular  dynamics  (MD)  is  a  powerful  simulation  tool  for  studing  struc¬ 
tural  and  dynamical  properties  of  liquids  and  solids.  Recently,  more  attention 
has  been  focused  on  understanding  the  molecular  mechanisms  of  nonstationary 
macroscopic  processes  such  as  shock  wave  [1,  6],  detonation  [8],  fracture  and 
failure  [3],  partly  due  to  the  advent  of  massively  parallel  computers. 

In  the  present  work  we  apply  the  MD  method  for  simulation  of  a  planar  shock 
w’ave  in  Lennard-Jones  solid.  The  principal  limitation  to  such  simulation  is  that 
the  shock  layer  properties  can  vary  significantly  within  a  few  lattice  spacings.  In 
the  most  general  case,  both  the  space-  and  time-dependences  of  all  the  dynamical 
quantities  need  to  be  considered.  Thus,  sufficiently  large  cross-.sectional  area  is 
required  to  reduce  large  nonphysical  fluctuations.  Up  to  now,  the  number  of 
atoms  per  transverse  plane  was  typically  10-  -  10^  which  is  not  sufficient  for 
reducing  the  fluctuations  considerably.  Owing  to  these  fluctuations,  important 
characteristics  of  the  shock  layer,  such  as  the  evolution  of  velocity  distribution 
function  across  the  layer,  have  not  been  well  studied.  One  way  to  improve  the 

*  Present  address:  Universidade  Federal  de  Sergipe,  DEI/CCET,  49100-000,  Sao- 
Cristovao  -  SE,  BrazU,  e-mail:  zybin@sergipe.ufs.br.  This  research  was  supported 
by  Russian  Foundation  for  Basic  Research,  grant  96-01-01901. 
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quality  of  simulation  is  to  take  a  time  average,  but  it  is  possible  for  modeling 
only  steady  shock  waves.  It  has  been  employed  in  [1]  through  the  use  of  special 
potential  configuration,  which  makes  it  possible  to  generate  a  steady  shock  wave 
at  rest  in  the  laboratory  frame.  Recent  advances  in  parallel  computers  provide 
a  means  for  multi-million  atoms  simulations  of  such  nonstationary  processes 
[2,  3],  which  enables  one  to  extend  considerably  the  cross-sectional  area.  How¬ 
ever,  the  implementation  of  message-passing  multi-cell  MD  on  massively  paral¬ 
lel  compjiters  [2,  9]  usually  involves  intensive  interprocessor  communications  on 
each  time  step  and  possible  non-uniform  workload  of  processors.  It  complicates 
the  implementation  of  spatial-decomposition  technique  on  less  sophisticated 
and  cheap  heterogeneous  multicomputers  such  as  network-connected  clusters 
of  workstation  or  PC-clones  coupled  with  free  PVM/Linux  software. 

Here  we  implement  an  alternative  ensemble-decomposition  approach  that 
consists  in  taking  a  statistical  average  over  canonical  ensemble  by  repeating 
the  shock  wave  simulations  with  different  initial  conditions.  An  advantage  of 
this  approach  is  a  straightforward  implementation  on  parallel  computers  with 
virtually  no  interprocessor  communications,  where  each  processor  is  responsible 
for  independent  simulation.  It  has  been  applied  in  modeling  a  shock  wave  in 
Lennard-Jones  crystal  with  10^  —  10^  atoms  in  the  cross-sectional  area  and  10" 
simulation  runs.  The  time-dependent  profiles  for  density,  velocity,  mean  square 
fluctuations  of  the  longitudinal  and  transverse  velocity  components,  internal 
energy  and  pressure  tensor  were  obtained.  We  also  measured  the  velocity  distri¬ 
bution  functions,  the  probability  density  for  the  potential  energy  and  the  pair 
correlation  functions  in  several  transverse  planes  within  the  shock  layer. 


2  Parallel  ensemble-averaged  MD  algorithm 


We  have  developed  a  parallel  algorithm  of  ensemble-averaged  MD  method 
m  the  PVM  programming  environment  for  simulation  of  shock  wave  in  the 
fee  lattice  composed  of  atoms  interacting  via  Lennard-Jones  (6-12)  potential 
U{r)  =  Ae[{cT/ry^  -  (tr/r)®].  The  program  was  initially  developed  in  the  PVM 
on  distributed  shared  memories  machine  CONVEX  SPP-1000  and  then  adapted 
on  IBM  SP2  RS/6000  and  the  network-connected  PC-clone.  The  algorithm  has 
the  master-slave”  parallel  structure  presented  by  the  following  scheme 


Master 

1. Initialization:  Compute  initial  data  and  send  to  K  Slaves 

2.  The  beginning  of  parallel  computations 
for  n=l  to  number  of  simulation  steps  pardo 

receive  from  Slaves  binned  profiles  of  variable  aj,,  k  =  l,  ...,K 

compute  ensemble  average  (a;  f)  =  ^  J2k  (or  by  Metropolis  procedure) 

end  pardo 

3.  Repeat  the  step  2  if  required 

4. KiIl  Slaves  and  finish  the  computations 
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Slaves 

1. Initialization:  Receive  from  Master  initial  data 

2.  Computations  in  Slave  k  (k=l,...,K) 
for  n=l  to  number  of  simulation  steps  do 

compute  forces  /(r”~^)=-X]j^,- move  atoms  to  new  positions  r,^ 
compute  binned  profile  of  dynamical  variable  a* 
send  ag  to  Master 
end  do 

3.  Repeat  the  step  2  if  required 


The  algorithm  consists  in  concurrent  simulations  of  different  systems  from  a 
canonical  ensemble  generated  by  randomization  of  the  initial  velocities  of  atoms. 
From  time  to  time,  the  binned  spatial  profiles  of  a  dynamical  variable  a  are 
calculated  in  each  simulation  subtask.  Then  the  averaging  over  K  systems  of 
ensemble  is  performed  yielding  the  expectation  value  (a;/)  for  a  distribution 
function  /.  The  theoretical  speed-up  for  the  algorithm  presented  above  is 

Speed-up  —  i^^comp^^  )/(^comp  “b  '^comml ^ ^)) 

where  Nx  is  the  number  of  atoms  in  cross-sectional  area,  Tcomp,  Tcomm  - 
the  parameters  responsible  for  computation  and  communication  time.  As  the 
computational  experiments  show,  the  communication  time  is  negligible  small  in 
comparison  to  the  comnutation  time  fsee  Fieure  1). 


Fig.  1.  The  speedup  obtained  on  different  computer  architectures:  (a)  single  hypernode 
of  the  CONVEX  SPP-1000  (4  processors),  (b)  4  IBM  RS/6000  POWER2  networked 
workstations  (for  different  numbers  Nx  of  atoms  in  cross-section). 
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It  should  be  noted  that  any  standard  MD  program  optimized  for  sequential 
execution  can  be  readily  implemented  in  this  algorithm  with  minor  changes. 
However,  there  are  two  cases  where  its  use  becomes  impracticable;  when  it  is 
necessary  to  use  a  large  system  size  (due  to  the  memory  limitations  on  the 
number  of  atoms)  and  when  the  simulation  time  is  longer  than  can  be  realistically 
achieved  using  a  single  processor  (due  to  the  large  update  time  per  atom).  In  such 
situations  the  best  approach  is  the  parallel  spatial-decomposition  MD  technique. 


3  Simulation  results 


The  algorithm  has  been  used  for  a  simulation  of  shock  wave  in  a  lattice  composed 
of  argon  atoms  (m  =  40  a.u.,  a  =  3.4A,  e/ke  =  120  "K).  The  rectangular 
simulation  cell  had  the  length  of  100  -  150  unit  cells  (200  -  300  planes  of  atoms) 
in  the  z  direction  of  shock  propagation.  The  transverse  x  dimension  was  usually 
50  -  100  unit  cells  with  periodic  boundary  conditions  imposed  along  the  x  axis 
The  initial  density  no  was  chosen  to  be  0.93-1.03  and  the  temperature  To  =  0.1 

A  planar  shock  wave  is  initiated  by  causing  a  few  atom  planes  to  move 
with  a  constant  piston  velocity  Up  in  the  z  direction.  During  a  simulation  the 
piston  atoms  are  constrained  to  remain  at  their  moving  lattice  sites.  The  time- 
dependent  profiles  for  velocity,  density,  mean  square  fluctuations  of  the  longitu¬ 
dinal  and  transverse  components  of  atom  velocity  (“kinetic  temperature'’  compo¬ 
nents),  internal  energy,  and  pressure  tensor  were  obtained.  We  also  measured  the 
pair  correlation  functions,  the  distribution  functions  of  the  velocity  components 
and  the  probability  density  for  the  potential  energy  in  several  planes  z  =  const 
w'ithin  the  shock  layer  at  different  times  for  describing  the  evolution  of  the  lattice 
structure  during  the  shock  compression. 

The  simulation  cell  is  divided  into  bins  along  the  z  direction  to  obtain  the 
shock-wave  profiles.  Typically  the  number  of  bins  was  equal  to  twice  the  number 
of  unit  cells  in  uncompressed  lattice,  giving  a  bin  width  0.87a-0.96(r.  The  local 
properties  at  a  point  are  obtained  by  taking  a  spatial  average  over  a  bin  around 
point  and  an  average  over  the  systems  of  an  ensemble.  We  have  followed  the 
approach  [4]  based  on  the  formulas  given  in  [7]  for  the  expectation  value  (o :  /) 
of  dynamical  variable  a  over  an  ensemble  having  distribution  function  /.  It  is 
assumed  that  a  local  property  dependent  directly  on  atomic  position,  such  as 
the  mass  density,  is  given  by 


=  Zi(r,  -r);/), 


Zl(r,-  -  r)  = 


r  l/(5d),if  z  -  id  < 
\  0  ,otherwise, 


<  z  -f-  7;d. 


where  d  is  the  bin  width,  S  is  the  area  of  cross  section  of  MD  cell,  and  ii(r;  -r  ) 
is  the  localization  function  (in  [7]  the  Dirac’s  ^-function  was  used).  For  a  local 
property  dependent  on  interatomic  separation  r,:^- ,  such  as  the  stress  tensor  the 
interaction  of  atoms  on  the  opposite  sides  of  5  are  taken  into  consideration 


(T(rj)  =  - (v,-u)(v,-u)Z\(r,-r); /)+ i-V 

1  2^\  Tij  dVij 
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where  u(rj)  is  the  mean  velocity  in  the  bin  centered  about  r.  All  the  generated 
systems  are  accepted  for  taking  ensemble  averages,  implying  the  constant  dis¬ 
tribution  function.  The  fluctuations  of  internal  energy  measured  for  different 
systems  of  an  ensemble  were  sufficiently  small  in  the  simulation.  Besides,  one  can 
use  the  canonical  distribution  function  by  introducing  a  Metropolis  procedure 
to  accept  or  reject  the  new  realization  at  a  given  time  f  as  in  [5]. 

Figure  2  shows  some  simulation  results  of  typical  example  of  shock  wave  in 
2D  lattice  with  100  atoms  in  the  cross  section.  The  parameters  of  simulation  (pis¬ 
ton  velocity  Up/co  =  0.7,  where  Cq  -  longitudinal  zero- temperature  sound  speed, 
Mach  number  M  =  3,  compression  rii/no  w  30%)  are  representative  for  rather 
strong  steady  shock  wave.  The  simulation  makes  it  apparent  that  the  fluctuation 
of  the  longitudinal  velocity  component,  T„  ~  grows  faster  than 

the  fluctuation  of  the  transverse  component  T  ~  —  u^)-.  A  similar 

phenomenon  has  been  observed  previously  [1,  6].  The  difference  between  r„  and 
Tf  leads  to  the  anisotropy  of  pressure  within  the  shock  layer  and  to  the  effect 
similar  to  the  surface  tension  [1].  The  evolution  of  the  velocity  component  V; 
distribution  function  across  the  shock  layer  reveals  significant  deviation  not  only 
from  the  Maxwellian  equilibrium  distribution  but  also  from  the  corresponding 
bimodal  distribution.  The  virial  terms  of  normal  and  tangent  P/  components, 
and  the  difference  between  them  are  also  presented  as  well  as  the  evolution  of 
potential  energy  distribution  function  across  several  planes  rr  =  const  within 
the  shock  layer.  The  simulation  results  were  obtained  for  200  systems  from  an 
ensemble  showing  a  considerable  reduction  in  statistical  fluctuations. 

The  experiments  on  network-connected  multicomputers  in  PVM  environment 
confirm  an  efficiency  of  the  algorithm  for  obtaining  ensemble  averages  of  the 
time-dependent  dynamical  variables.  The  three-dimensional  ensemble-averaged 
MD  simulation  of  a  shock  wave  in  solid  states  are  currently  in  progress. 

The  computational  resources  were  provided  by  the  Keldysh  Institute  of 
Applied  Mathematics  of  Russian  Academy  of  Sciences  and  the  National  Center 
of  Supercomputing  of  the  Federal  University  of  Rio  Grande  do  Sul.  I  would  like 
to  thank  S. I. Anisimov  and  V.V.Zhakhovskii  for  encourangement,  support  and 
many  useful  discussions  concerning  this  work. 
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Fig.  2.  a)  Spatial  profiles  of  mean-square  fluctuations  of  the  longitudinal  Tn  and  trans- 
verse  Ft  velocity  components  with  corresponding  profiles  of  the  density  n.  b)  Spatial 
profiles  of  normal  (P„)  and  tangent  (P/)  components  of  potential  contribution  to  the 
pressure  tensor.  Distance  i:  from  the  piston  is  given  in  a  units,  c)  Distribution  func- 
tions  of  the  longitudinal  u.  velocity  component  in  different  layers  normal  to  s-axis 
d)  Distribution  functions  of  the  potential  energy  in  different  lavers  normal  to  ^-axis 
Layers  are  numbered  from  upstream  to  downstream.  Piston  velocity  =  0  Tco  (co 
-  longitudmal  zero-temperature  sound  speed).  Shock  velocity  is  3.0 co,  compression 
ni/no  IS  The  data  were  averaged  over  200  systems  from  an  ensemble. 
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Abstract  The  Bulk  Synchronous  Parallel  Model,  BSP  has  been  proposed  by 
Valiant  to  predict  the  performance  of  current  parallel  systems.  In  the  BSP 
model  the  computation  is  divided  in  supersteps.  The  fundamental  assumption  of 
the  BSP  model  is  the  /i-relation  hypothesis.  This  states  that  the  communication 
time  of  a  given  superstep  is  proportional  to  the  maximum  number  h  of  packets 
communicated  by  any  processor.  This  paper  makes  a  brief  survey  of  the  BSP 
parallel  computational  model  and  studies  the  validity  of  the  /i-relation 
hypothesis  using  current  standard  message  passing  parallel  software  and  current 
standard  network  technology.  We  measure  the  influence  of  the  communication 
pattern  on  the  time  invested  in  an  /i-relation.  The  conclusion  is  that  a  linear 
model  based  in  the  /i-relation  hypothesis  can  be  used  to  predict  the  execution 
time  for  a  wide  set  of  algorithms  written  using  Standard  Message  Passing 
Libraries. 


1  Introduction 

Among  the  plethora  of  parallel  computational  models  proposed,  PRAM,  Networks, 
BSP  and  LogP  are  the  most  popular.  The  PRAM  model  [3]  has  been  widely  used  to 
represent  the  complexity  of  parallel  algorithms.  The  model  is  simple  and  useful  for  a 
gross  classification  ot  parallel  algorithms  but  is  unrealistic  because  all  processors 
work  synchronously  and  inter-processor  communication  is  free.  It  assumes  a  single 
shared  memory  where  each  processor  can  access  any  cell  in  unit  time  and  neglects 
contention  caused  by  concurrent  access  to  different  cells  within  the  same  memory 
module.  In  a  Network  Model  [6],  communications  are  only  allowed  between  directly 
connected  processors;  other  communications  are  explicitly  forwarded  through 
intermediate  nodes.  Many  algorithms  have  been  created  which  are  perfectly  matched 
to  the  structure  of  a  particular  network.  However  these  elegant  algorithms  lack 
robustness,  as  they  usually  do  not  map  with  equal  efficiency  onto  interconnection 
structures  different  from  those  for  which  they  were  designed. 

Many  of  current  parallel  computers  consist  of  a  collection  of  complete  computers 
connected  through  a  network  interface  to  a  multistage  interconnection  network.  Culler 
el  al.  [2]  believe  that  this  hardware  organization  is  going  to  dominate  commercial 
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Massively  Parallel  Computers  in  the  near  future.  The  LogP  Model,  [2]  characterizes  a 
parallel  hardware/software  platform  by  four  parameters:  the  number  of  processors  (P), 
the  gap  (g),  the  latency  (L)  and  the  communication  overhead  o.  The  model  also 
assumes  that  if  a  processor  attempts  to  transmit  more  than  [L/g]  not  consumed 
messages,  it  will  stall  until  the  message  can  be  sent  without  exceeding  the  limit 
Although  the  model  encourages  the  careful  scheduling  of  communication  and 
overlapping  of  communications  and  computations,  there  is  a  concern  that  a  complete 
LogP  analysis  for  non-trivial  algorithms  is  in  not  few  cases  almost  unfeasible. 

Section  2  introduces  the  BSP  model.  Section  3  measures  the  influence  of  the 
communication  pattern  on  the  time  invested  in  an  /i-relation.  Section  4  concludes  that 
the  linear  model  approach  proposed  in  section  3,  can  be  used  to  predict  the 
performance  of  PVM  [4]  and  MPI  [  1 1  ]  bulk  synchronous  programs 


2  The  Bulk  Synchronous  Parallel  Model. 


The  BSP  model  [12]  tries  to  provide  a  simple  but  accurate  interface  between  the 
domains  of  parallel  architectures  and  algorithms.  In  the  BSP  model,  a  parallel 
machine  consists  of  a  set  of  processors,  each  with  its  own  private  memory,  and  an 
interconnection  network  that  can  route  packets  of  some  fixed  size  between  processors. 
The  computation  is  divided  in  supersteps.  In  each  superstep,  a  processor  can  perform 
operations  on  local  data,  send  packets,  and  receive  packets.  This  local  computation 
must  depend  only  on  data  present  in  the  local  memory  of  the  processor  at  the 
canning  of  the  superstep.  A  packet  sent  in  one  superstep  is  guaranteed  to  be 
delivered  to  the  destination  processor  at  the  beginning  of  the  next  superstep. 
Consecutive  supersteps  are  separated  by  a  global  synchronization  of  all  processors. 

The  two  basic  BSP  parameters  that  model  a  parallel  machine  are;  the  gap  g,  which 
reflects  per-processor  network  bandwidth,  and  the  minimum  duration  of  a  superstep 
L,  which  reflects  the  latency  to  send  a  packet  through  the  network  as  well  as  the 
overhead  to  perform  a  global  synchronization.  Let  be  h  the  maximum  number  of 
packets  a  processor  communicates  (the  sum  of  the  packets  received  and  sent)  in  a 
^  communication  pattern  is  called  an  /i-relation).  The  fundamental  of 
the  BSP  model  lays  on  the  /i-relation  hypothesis  introduced  by  Valiant.  It  states  that 
the  communication  time  spent  on  an  /i-relation  is  given  by 

Communication  Time  =  g  h  (1) 

Let  denote  by  IV  the  maximum  time  spent  in  local  computation  by  any  processor 
during  the  superstep.  The  BSP  model  guess  that  the  running  time  of  a  superstep  is 
bounded  by  the  formula;  ^ 


Time  Superstep  =  W  +  g  h  +  L  (2) 

In  consequence,  the  design  of  algorithms  under  the  BSP  model  tries  to  minimize 
the  number  of  supersteps,  the  maximum  number  of  operations  performed  by  any 
processor  jVand  the  maximum  number  h  of  packets  communicated.  A  virtue  in  BSP 
of  having  barriers  available  as  a  primitive  is  that  analysis  is  simplified  by  assuming 
the  processors  exit  the  barrier  in  synchrony.  Special  libraries  have  been  built  to 
support  the  BSP  style  of  programming  [8].  However,  such  software  is  not  still  widely 
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extended.  There  is  no  doubt  that  MPI  and  PVM  constitute  the  facto  current  standards 
for  distributed  computers. 


3  Checking  the  Validity  of  the  /t -relations  Hypothesis 

The  experiments  were  done  in  the  IBM  Scalable  POWERparallel  SP2  [1].  In  this 
distributed-memory  parallel  computer,  processors  or  nodes  are  interconnected  through 
a  High  Performance  Switch  (HPS).  The  HPS  is  a  bi-directional  multistage 
interconnection  network.  The  computing  nodes  are  Thin2,  each  powered  by  a  66MHz 
Power2  RISC  System/6(X)0  processor.  All  the  algorithms  were  implemented  in  PVMe 
[5],  the  improved  version  of  the  message-passing  software  PVM. 

The  /z-relation  hypothesis  does  not  consider  the  influence  of  communication 
patterns.  For  example,  independently  of  the  number  of  pairs,  processors 
communicating  according  to  a  PingPong  algorithm  fall  in  the  same  /i-relation  class. 
That  '\sh=  n,  where  n  is  the  size  of  the  outgoing  packets.  Their  cost  under  the  BSP 
model  matches  the  cost  of  a  single  couple  of  communicating  processors:  g*n.  This  h- 
relation  class  appears  for  the  exchange  pattern  for  packets  of  size  n/2.  When  p- 
processors  are  involved,  the  personalized  OneToAll  and  AllToOne  communication 
patterns  fall  into  the  same  former  class  of  /i-relations  for  packets  of  size  m  =  n/(p-J). 
The  same  /i-relation  appears  under  the  personalized  AllToAll  communication  pattern 
when  the  size  of  the  outgoing  messages  is  m  =  n/2*(p-]).  Each  processor  sends  (p- 
])*m  packets  and  receives  the  same  number  (p-l)*m.  The  number  of  communications 
performed  by  any  processor  is  2*(p-l)*m  =  n  =  h.  The,  actual  times  spent  on  these 
five  patterns  for  their  respective  packet  sizes  have  to  be  similar  if  the  /i-relation 
hypothesis  holds. 

Table  I  shows  the  influence  of  the  communication  pattern  in  the  time  spent  in  an  h- 
relation.  Experiments  were  carried  out  for  each  pattern  with  the  /i-relation  size 
between  420  and  13762560  bytes  and  the  number  of  processors  between  2  and  8.  For 
the  PingPong  and  Exchange,  2,  4,  6  and  8  communicating  couples  were  used.  For  the 
others,  experiences  involved  4,  6  and  8  processors.  For  each  fixed  number  of 
processors,  500  experiences  were  performed.  The  entry  in  each  column  shows  the 
average  time  in  seconds.  The  Exchange  pattern  is  the  fastest  due  to  the  maximum 
parallelism  it  achieves.  On  the  Exchange  pattern  the  two  processors  in  each  couple 
simultaneously  send  their  messages.  On  the  other  extreme,  the  PingPong 
communication  pattern  is  the  slowest  since  it  implies  the  most  sequential  case  of 
sending  (receiving)  by  one  processor  the  h  bytes  implied  in  the  /i-relation.  The  time 
for  any  other  pattern  is  in  the  range  between  these  two.  An  straightforward 
implementation  of  the  personalized  OneToAll  is  to  consecutively  send  "the  whole 
message  to  each  of  the  other  processors.  The  policy  we  propose  is  to  divide  the 
message  in  packets  and  proceed  to  apply  to  each  packet  the  former  algorithm.  This 
policy  is  optimal  using  a  packet  size  of  32KB.  The  best  policy  for  the  AllToAll 
pattern  for  /i-relations  under  430080  Bytes  is  to  start  sending  all  the  messages 
according  to  a  processor  permutation.  From  this  size  on,  the  network  becomes 
saturated  and  it  is  better  to  consume  the  incoming  messages.  Although  the  values  do 
not  appear  in  Table  I,  tor  all  the  patterns,  the  dependency  of  times  in  the  number  of 
processors  was  negligible  (under  0.2%  for  /z-relations  larger  than  215040  bytes). 
Observe  that,  the  times  for  Exchange,  OneToAll,  AllToOne  and  AllToAll  keep  closer 
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among  them  than  the  PingPong  time.  The  maximum  difference  percentage  max,  {( 

max(t,(/i))-min  t,(;i)  )/min{ume,^(/i)}  /  i,j,k  PingPong)  is  21%,  reached  for  h  = 
13762560  bytes. 


PingPong 

Exchange 

OneToAU 

AllToAlI 

msm 

MaxErr 

420 

0.0001 14 

0.000114 

0.000269 

0.000157 

0.000352 

40.30 

202.94 

840 

0.000139 

0.000121 

0.000362 

37.51 

188.03 

1680 

0.000191 

0.000153 

0.000313 

0.000206 

0.000380 

34.19 

141.69 

3360 

0.000259 

0.000217 

0.000356 

0.000265 

0.000439 

27.89 

100.63 

6720 

0.000394 

0.000306 

0.000442 

0.000377 

0.000527 

17.57 

62.21 

13440 

0.000659 

0.000470 

0.000659 

0.000596 

0.000689 

7.44 

25.47 

26880 

0.001168 

0.000855 

0.001081 

0.000993 

0.001088 

0.46 

20.75 

53760 

0.002239 

0.001553 

0.001939 

0.001822 

0.001887 

-3.73 

26.11 

107520 

0.004820 

0.003105 

0.003745 

0.003562 

0.003492 

r-1.79 

32.48 

215040 

0.009418 

0.006418 

0.007508 

0.007080 

0.006883 

-0.75 

29.61 

430080 

0.018768 

0.012684 

0.015016 

0.014222 

0.013685 

-0.36 

30.27 

860160 

0.037199 

0.025448 

0.030131 

0.028558 

0.026845 

-0.39 

29.26 

1720320 

0.074260 

0.050447 

0.060469 

0.057548 

0.053980 

-0.10 

29.46 

3440640 

0.148477 

0.100577 

0.121139 

0.115967 

0.106736 

-0.09 

29.62 

6881280 

0.297393 

0.201113 

0.241496 

0.232967 

0.212628 

-0.06 

29.89 

13762560 

0.593549 

0.402237 

0.487938 

0.467001 

0.422079 

0.03 

29.61 

Table  1.  Pattern  Communication  Times  and  Error  Percentage  for  different  A -relation 
sizes. 


To  obtain  the  general  linear  approach  to  the  /i-relation  time  we  have  computed  the 
least  square  fit  of  the  average  times  of  the  five  patterns.  This  gives  L  =  1 .06*  1 0"*  and 
g  =  3.45*10  .  Compare  these  BSP-PVM  values  with  the  obtained  using  the  Oxford 
BSP  library;  g'  =  35*10",  L'  =  4.62*10"  for  the  same  machine  [7],  Columns  labeled 
Av.  Err.  and  Max.  Err.  respectively  show  the  average  and  maximum  errors  defined  as; 

AvErr(h)  =)00{(  X  iT^(h)/5}-(gh+L}}/(^.  T.(h)/5):  i  in  the  set  of  patterns}. 
MaxErifh)  =I00{  max  ^\T,(h)-(gh+L)\  /  minj  Tfh):  i,j  in  the  set  of  patterns}. 

Negative  numbers  in  the  Average  Error  column  correspond  to  cases  in  which  the 
model  time  is  larger  than  the  actual  time.  For  h  larger  than  26880,  the  Average  Error 
is  under  4%.  From  13440  on,  the  Maximum  Error  keeps  almost  constant  around  30%. 


4  Conclusions. 


The  collective  computation  provided  by  MPI  fits  the  Bulk  Synchronous  Programming 
Methodology.  Extensions  of  PVM  like  La  Laguna  C  [9]  make  PVM  a  tool  suitable  for 
the  expression  of  BSP  algorithms.  Based  in  the  A-relation  hypothesis,  a  linear  model 
approach  to  predict  the  performance  of  PVMA4PI  bulk  synchronous  programs  has 
been  presented.  The  maximum  error  incurred  by  neglecting  the  influence  of 
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communication  patterns  is  under  30%  for  medium  and  large  /?-relation  sizes.  A  more 
accurate  prediction  can  be  achieved  by  using  the  values  for  g  and  L  obtained  for  each 
pattern  [9], 
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Abstract.  In  this  paper  we  study  theoretically  two  different  one-sided 
block  Jacobi  algorithms  for  solving  the  Symmetric  Eigenvalue  Problem. 
Sequential  and  petrallel  versions  of  the  algorithms  au:e  analized  cind  com¬ 
pared  with  a  two-sided  block  Jacobi  algorithm.  The  main  advantage  of 
the  one-sided  algorithms  is  that  they  are  better  suited  to  parallel  com¬ 
puters,  and  when  computing  eigenvalues  and  eigenvectors  on  multicom¬ 
puters  a  more  reduced  execution  time  is  predicted  for  the  one-sided  al¬ 
gorithms  than  for  the  two-sided  eilgorithm. 


1  Introduction 

In  this  work  we  studied  the  design  of  two  one-sided  block  Jacobi  algorithms 
for  the  Symmetric  Eigenvalue  Problem.  The  algorithms  are  designed  using  as  a 
basic  the  two  one-sided  Jacobi  algorithms  proposed  in  [1]. 

We  begin  by  explaining  how  a  two-sided  block  Jacobi  method  works  [2],  and 
after  that  the  two-sided  method  will  be  compared  with  two  one-sided  block  Jac¬ 
obi  methods.  The  main  goal  of  the  comparison  is  to  conclude  if  one-sided  block 
Jacobi  algorithms  can  be  designed  maintaining  the  high  degree  of  parallelism 
of  the  algorithms  not  working  by  blocks,  and  the  one-sided  methods  can  be 
competitive  with  two-sided  block  algorithms. 

2  A  two-sided  block  Jacobi  algorithm 

The  method  works  over  two  matrices:  the  matrix  A  and  a  matrix  V  where 
the  rotations  are  accumulated.  Matrix  V  is  initially  the  identity  matrix.  Both 
matrices  .4  and  V  are  divided  into  columns  and  rows  of  square  blocks  of  size 
s  X  s,  and  these  blocks  are  grouped  to  obtain  bigger  blocks  of  size  2s  x  2s. 

*  Partially  supported  by  Comision  Interministericil  de  Ciencia  y  Tecnologi'a,  project 
TIC96-1062-C03-02;  Consejeria  de  Cultura  y  Educacion  de  Murcia,  Direccion  Gen¬ 
eral  de  Universidades,  project  COM-18/96  MAT;  and  Accion  Integrada  Hispano- 
Lusa  HP1996-0007. 
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Jacobi  methods  work  by  constructing  a  matrix  sequence  {yl;}  by  means  of 
-■^'+1  —  QiAiQi  ,  /  =  1,2,...,  where  A\  =  ^.  In  a  non  block  version  of  the 
method,  Qi  represents  a  plane  rotation  and  each  product  QiAiQ]  annihilates  a 
pair  of  nondiagonal  elements,  and  Oj,-,  of  matrix  Ai,  but  in  a  block  version, 
each  Qi  represents  a  set  of  rotations  that  nullify  elements  in  a  block  of  ,4; .  In  each 
block  the  algorithm  works  by  making  a  sweep  over  the  elements  in  the  block. 
The  subdiagonal  elements  belonging  to  diagonal  blocks  will  not  be  zeroed.  To 
correct  this,  blocks  corresponding  to  the  first  Jacobi  set  are  considered  to  be 
of  size  2s  X  2s,  adding  to  each  block  the  two  adjacent  diagonal  blocks  and  the 
symmetrical  block.  The  work  over  each  block  can  be  performed  using  level- 1 
BLAS.  The  corresponding  rotations  are  accumulated  to  form  a  matrix  Q  of  size 
2s  X  2s.  Finally,  the  corresponding  columns  and  rows  of  blocks  of  size  2s  x  2s 
of  matrix  A  and  the  rows  of  blocks  of  matrix  V  are  updated  using  Q.  These 
matrix-matrix  multiplications  can  be  effected  using  level-3  BLAS. 

After  completing  a  set  of  blocked  rotations,  a  swap  of  column  and  row  blocks 
is  performed,  according  to  the  order  we  are  using.  The  odd-even  order  will  be 
used.  [3],  because  it  simplifies  a  block  based  implementation  of  the  sequen¬ 
tial  algorithm,  and  allows  parallelization.  If  n  =  8,  numbering  indices  from 
1  to  8,  and  initially  grouping  the  indices  in  pairs  {{1, 2),  (3, 4),  (5, 6),  (7, 8)}. 
the  sets  of  pairs  of  indices  are  obtained  as  follows:  |(1,2)  (3  4)  fs'ei'f?  811 

{2.(1,4),(3,6),(5,8),7}.{(2,4),(1,6),(3,8),(5,7)},....  ’  ’  '  ' 

This  data  movement  brings  the  next  blocks  of  size  s  x  s  to  be  zeroed  to  the 
subdiagonal,  and  the  process  continues  similarly  to  operations  performed  in  the 
first  step.  However,  in  this  case  the  elements  to  be  nullified  are  in  square  blocks 
of  size  s  X  s  inside  diagonal  blocks  of  size  2s  x  2s.  This  data  movement  will  imply 
data  transferences  in  the  parallel  version  of  the  algorithm. 

The  cost  per  sweep  is: 

Sk^n^  -f  (12A-1  -  I6A3)  n.^s  +  Sksns"^  flops,  (1) 

where  ki  and  A3  represent  the  cost  of  an  arithmetic  operation  performed  using 
BLAS  1  or  BLAS  3,  respectively. 


2.1  A  parallel  algorithm 

It  IS  possible  to  obtain  a  balanced  algorithm  for  a  ring.  Grouping  blocks  of  size 
2s  x  2s  of  matrix  A  and  V  in  bigger  blocks  and  Vij  of  size  2s A  x  2sA,  we 
assign  to  each  processor  P.,  with  p  =  f  and  5^  =  g,  rows  of  blocks  /  and 
9  —  1  —  f  of  matrices  A  and  V .  Therefore,  each  processor  Pi  contains  blocks  .4,  , 
and  .4,_i_,  with  0  <j  <  i,  and  Vij  and  j,  with  0  <  j  <  q. 

Due  to  the  data  movement  between  odd  and  even  steps,  it  is  necessary  to 
reserve  some  additional  memory,  and  (2sA+s)(2n+2sA4-2s)  positions  of  memory 
are  reserved  on  each  processor. 

The  arithmetic  cost  per  sweep  when  computing  eigenvalues  and  eigenvectors 
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•3  O') 

11'^  f}.  77  <?  ** 

8A-3y +  (12^*1 -8^'3)-y  +  12/ti^  flops.  (2) 

And  the  cost  per  sweep  of  the  communications  is: 

/?(p  +  3)  j  +  r  |^8n' +  2ns  -  ,  (3) 

where  /?  and  r  represent  the  start-up  and  the  word-sending  time,  respectively. 

3  A  one-sided  block  Jacobi  algorithm.  First  version 

We  analized  the  first  one-sided  Jacobi  algorithm  in  the  paper  [1].  The  algorithm 
works  on  matrices  Bo  =  A  and  Wq  =  I,  obtaining  Br+i  =  VrBr,  Wr+t  = 
with  Vr  the  rotation  matrix  nullifying  a  non-diagonal  element  of  matrix  Ar  = 

BrW^  =  v;_i i;_2 . . .  VoBoW^v^ . . .  vf_,. 

To  nullify  a,j  it  is  necessary  to  compute  an,  ajj  and  atj,  because  the  al¬ 
gorithm  w'orks  on  matrices  Br  and  Wr,  and  not  on  matrix  Ar.  These  elements 
are  obtained  with  three  dot  products.  After  that,  rows  i  and  j  of  Br  and  Wr 
are  updated.  If  the  diagonal  elements  are  stored  in  an  auxiliary  vector,  it  is  not 
necessary  to  compute  an  and  ajj  every  time,  and  the  cost  per  sweep  is: 

_  3  15  rj  Tl 

7”  “  y"  +  2 

We  propose  a  one-sided  block  Jacobi  algorithm  by  combining  the  ideas  of  the 
two-sided  block  algorithm  and  the  ideas  of  the  one-sided  algorithm. 

Matrices  B  and  IT,  of  size  n  x  n,  are  divided  in  blocks  of  size  s  x  n.  and 
blocks  of  A  =  BW  are  treated  using  the  odd-even  ordering. 

Initially  the  ^  blocks  corresponding  to  the  first  Jacobi  set  are  treated,  mak¬ 
ing  a  two-sided  sweep  on  blocks  of  size  2s  x  2s  of  matrix  A  and  accumulating 
rotations.  These  operations  are  done  using  BLAS  1. 

After  that,  matrices  B  and  W  are  updated  multiplying  the  rotation  matrices, 
of  size  2s  X  2s,  by  the  corresponding  blocks  of  B  and  W,  of  size  2s  x  n.  In  this 
case  matrices  B  and  W  are  not  symmetric. 

In  the  two-sided  algorithm  a  movement  of  rows  and  columns  of  blocks  is 
performed  in  order  to  have  the  blocks  grouped  according  to  the  next  Jacobi  set. 
This  movement  can  be  include  in  the  updating  of  the  matrix  if  it  is  done  on  the 
rotation  matrix  before  updating  A.  In  the  one-sided  algorithm  the  movement  of 
rows  of  blocks  of  B  and  W  can  be  done  in  the  same  way  (figure  1). 

In  successive  steps  it  is  necessary  to  compute  An,  Ajj  and  A,j,  because  the 
w^ork  is  not  done  directly  with  matrix  A.  If  we  call  Bi  and  IT,-  the  t-th  row  of 
blocks  of  B  and  W  in  figure  l.a),  An  =  BiW-,  Ajj  =  BjWj  and  Aij  -  BiWj. 
If  the  diagonal  blocks  are  stored  it  is  not  necessary  to  compute  An  and  Ajj. 

After  the  blocks  An,  Ajj  and  Aij  are  computed,  a  matrix  of  size  2s  x  2s  is 
formed,  and  a  tw'o-sided  sweep  is  performed  on  this  matrix,  accumulating  the 
rotations. 
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Fig.  1.  Distribution  of  matrices  B  and  W  on  the  first  one-sided  block  algorithm:  a) 
initially,  b)  after  application  of  the  first  set  of  rotations. 


Fig.  2.  Initial  distribution  of  matrices  B,  D  and  W  in  the  system  of  processors  for  the 
first  one-sided  parallel  block  algorithm. 


The  cost  per  sweep  is: 

9A;3»?^  +  (12^1  —  QA’a)  n"s  flops.  (.5) 

3.1  A  parallel  algorithm 

It  is  possible  to  assign  to  each  processor  k  consecutive  blocks  of  size  2s  x  n  ,  with 
n  =  2skp,  of  matrices  B  and  W  (figure  2).  In  the  figure  the  distribution  of  the 
matrices  is  shown,  but  also  in  this  case  it  is  necessary  to  reserve  some  additional 
memory  to  store  data  in  sucessive  steps  of  the  algorithm.  The  quantity  of  memorv 
reserved  m  each  processor  is  (2k  +  l)sn  to  store  elements  of  5,  the  same  quantity 
to  store  elements  of  W,  and  (2k  +  1)5^  to  store  elements  of  D. 

The  arithmetic  cost  per  sweep  is: 

9^’3— +  12A;2 -f  12A:i -  flops.  (6) 

It  is  not  necessary  to  broadcast  the  rotation  matrices  because  each  pro¬ 
cessor  updates  the  rows  of  blocks  it  contains.  The  only  communications  are 
those  between  steps  to  group  data  according  to  the  ne.xt  Jacobi  set.  In  odd  steps 
blocks  of  size  s  x  n  of  B,  and  W,  and  a  diagonal  block  of  size  s  x  s  are  sent  from 
Pi  to  Pi_i,  with  7  —  1,2, . .  .,p—  1,  and  in  even  steps  the  same  communications 
are  done  from  P,_i  to  P,-.  Therefore,  the  cost  per  sweep  of  communications  is: 

2— d  +  (477^  +  2?is)  r  .  (7) 
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Fig.  3.  Initial  distribution  of  matrices  A  and  D  in  the  system  of  processor  for  the 
second  one-sided  parallel  block  algorithm. 


4  A  one-sided  Jacobi  algorithm.  Second  version 

The  second  one-sided  Jacobi  algorithm  proposed  in  [1]  has  the  advantage  of  a 
lower  execution  time,  but  also  has  the  disadvantage  of  a  worse  precision  [4], 

The  method  works  by  diagonalizing  matrix  B  =  but  without  explicitly 
form  B.  Rotations  7  nullifying  elements  6,j  of  B  are  applied  to  4.  If  initially 
.4i  =  A  and  Bi  =  AiA\,  we  will  have  Ar+i  =  KAr,  and  A  must  be  updated 
only  by  one  side.  Because  By  =  Ay  Ay,  it  is  necessary  to  perform  dot  products 
to  obtain  bn,  bjj  and  bij,  which  are  needed  to  obtain  the  next  Jacobi  rotation. 

The  cost  per  sweep  is  approximately  4n^  flops  if  the  elements  of  the  diagonal 
are  stored. 

The  method  has  some  problems  derived  from  the  fact  that  the  eigenvalues 
computed  are  tho.se  of  4^,  but  not  the  eigenvalues  of  ^  [1,  4]. 

To  design  an  algorithm  by  blocks  matrix  A  is  divided  into  consecutive  blocks 
of  size  s  X  n. 

Before  each  subsweep  on  a  block  Bn,  Bjj  and  Bij,  are  computed  (or  only  Bij 
if  the  diagonal  blocks  are  stored).  Even  if  the  diagonal  blocks  are  stored,  in  the 
first  step  all  the  blocks  must  be  computed,  because  the  algorithm  works  with  A 
and  not  with  B. 

The  cost  per  sweep  is; 

+  (12ki  —  dks)  n^s  flops.  (8) 


4.1  A  parallel  algorithm 

The  distribution  of  matrix  A  and  matrix  D,  where  the  diagonal  blocks  are  stored, 
can  be  that  shown  in  figure  3.  Also  in  this  case  it  is  necessary  to  reserve  some 
additional  memory.  The  size  of  memory  reserved  on  each  processor  to  store  data 
from  matrix  A  is  {2k  +  l)sn  and  to  store  data  from  matrix  D  is  {2k  -j-  l)s-. 

The  arithmetic  cost  per  sweep  is: 

77“  c  77 

5^’3— +  12A:i — h  12A;i -  flops.  (9) 
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Table  1.  Predicted  execution  time  of  different  parallel  block  Jacobi  algorithms. 


n  =  1024 

P  =  1 

p=  2 

p  =  4 

p  =  8 

p  =  16 

P  =  32 

p  =  64 

P  =  128 

tivo  —  sided 

412.3 

217.5 

108.9 

54.6 

27.5 

15.9 

9.4 

one  —  sided,  tiersionl 

463.8 

246.5 

123.6 

62.1 

31.3 

16.0 

9.3 

5.2 

one  —  sided,  version2 

257.6 

143.1 

71.7 

36.0 

18.1 

9.2 

5.2 

2.9 

The  only  communications  are  those  produced  by  the  data  movements  between 
steps.  In  odd  steps  s(n  +  s)  elements  are  sent  from  Pi  to  P,_i,  and  in  even  steps 

the  same  quantity  is  sent  from  P,-  to  P.+j.  The  cost  per  sweep  of  communications 
is: 


— /?  +  (2n“  +  2ns)  t 


5  Comparison  and  Conclusions 

The  first  version  of  the  one-sided  algorithm  has  a  higher  cost  than  the  two-sided 
method,  but  the  difference  is  smaller  in  the  algorithms  working  by  blocks  than 
in  the  algorithms  not  working  by  blocks.  The  second  one-sided  algorithm  has  the 
lowest  cost,  but  has  worse  precision.  Communications  are  less  costly  in  the  one- 
■sided  algorithms  because  it  is  not  necessary  to  broadcast  the  rotation  matrices. 
Also  in  the  communications  the  second  one-sided  algorithm  is  better  because 
It  works  with  one  matrix  and  only  half  of  the  data  must  be  transferred  Table 
1  shows  the  execution  time  predicted  on  the  Touchstone  Delta  for  matrix  size 
1024  and  a  variable  number  of  processors.  The  estimated  values  of  the  constants 
are  ([2])  ki  =  0.137/is,  =  0.048^s,  0  =  61^s  and  r  =  0.149/1.5.  We  can  see  the 

behaviour  of  the  one-sided  algorithms  is  better  when  the  number  of  processors 
increases.  This  is  why  it  could  be  interesting  to  implement  the  algorithms  here 
anahzed  and  to  compare  them  experimentally.  This  is  what  we  are  doing  now. 
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Abstract.  In  this  paper  we  set  out  to  study  the  performance  of  par¬ 
allelization  of  iterative  method  on  shared  memory  multiprocessor  using 
different  data  distributions.  We  start  with  the  study  of  block  and  cyclic 
distributions,  and  then  propose  a  mixed  distribution  which  combines  ad¬ 
vantages  of  both. 


Keywords:  Conjugate  Gradient,  Data  Distribution,  Distributed  Shared 
Memory  Systems,  Sparse  Systems. 

1  Introduction 

This  work  tackles  the  parallelisation  of  the  non-stationary  iterative  Conjugate 
Gradient  method  [1,6],  which  is  used  to  solve  sparse  linear  equation  systems. 
This  type  of  operation  frequently  appears  during  the  resolution  of  partial  differ¬ 
ential  equations,  and  one  of  its  characteristics  is  that  the  matrix  of  coefficients 
must  be  symmetric  and  positive-defined.  The  results  obtained  can  be  generalised 
to  other  iterative  methods,  due  to  the  fact  that  all  of  them  use  the  same  kind  of 
computations. 

The  system  on  which  the  parallelisation  of  the  algorithm  was  implemented 
was  the  distributed  shared  memory  multiprocessor  Origin  2000  by  Silicon  Graph¬ 
ics,  which  consists  of  8  MIPS  RIOOOO  processors  using  a  hardware  cache  coher¬ 
ence  protocol  based  on  the  directory  [4]. 

2  Data  distributions 

We  used  the  data-parallel  programming  paradigm  which,  as  well  as  being  easy 
to  program,  presents  high  complexity  in  the  establishment  of  optimisations.  The 
programming  language  used  was  fortran??.  The  parallelization  is  expressed  by 
means  of  parallelization  directives  [5],  which  direct  the  compiler  in  the  generation 
of  calls  to  the  low  level  libraries  in  the  multiprocessors.  The  elements  of  a  vector 
can  be  allocated  in  the  memory  of  the  system  using  two  distributions:  block  and 


693 


FEUP  -  F aculdade  de  Engenharia  da  Universidade  do  Porto 


cyclic.  By  means  of  a  block  distribution,  the  elements  of  a  vector  of  size  N  are 
divided  into  P  blocks  of  size  B  =  N/P  (where  P  is  the  number  of  threads).  In 
a  cyclic  distribution,  the  elements  are  divided  into  pieces  of  size  L  (in  our  case 
L  —  1),  and  then  they  are  distributed  cyclically  over  the  threads. 

On  the  Origin  2000  two  techniques  can  be  used  to  carry  out  these  distribu¬ 
tions,  regular  and  reshaped.  In  the  regular  scheme  the  elements  to  be  distributed 
have  to  be  pages  of  16Kb.  In  the  case  in  which  data  must  be  allocated  in  dif¬ 
ferent  memories  are  in  the  same  page,  the  compiler  will  not  be  able  to  resolve 
the  conflict,  and  will  place  the  whole  page  in  one  of  the  memories.  This  causes 
a  great  number  of  conflicts  of  false  sharing,  especially  for  cyclic  distributions,  as 
at  the  level  of  cache  line  consecutive  elements  will  belong  to  different  threads. 
This  is  reflected  in  a  strong  increase  in  the  number  of  operations  of  coherency, 
like  invalidations  or  exclusive  to  shared  transitions  in  cache  lines. 


Fig.  1.  Reshaped  scheme  over  a  vector. 

In  our  case  we  use  the  reshaped  scheme  illustrated  in  figure  1,  with  which 
the  compiler  can  reorganize  the  size  of  the  blocks  in  the  storage  structure  of 
the  memory  to  obtain  the  desired  distribution.  This  can  be  achieved  by  storing 
consecutively  the  array  elements  that  corresponds  to  each  local  memory.  Using 
the  reshaped  scheme,  both  the  cyclic  and  block  distributions  obtain  similar  values 
in  the  number  of  coherency  operations. 


3  Computations 

The  parallelization  of  the  algorithm  is  based  on  two  types  of  operations  which 
represents  the  greatest  computation  costs:  sparse  matrix-vector  products  and 
vectorial  operations  [3]. 

The  sparse  matrix-vector  product  is  carried  out  by  accessing  the  matrix  by 
columns,  which  is  the  same  as  reading  its  rows,  as  the  matrix  is  symmetric.  In 
this  way  it  is  possible  to  parallelize  the  product  so  that  each  processor  computes 
the  value  of  the  different  elements  of  the  resulting  vector,  thus  eliminating  pos¬ 
able  conflicts  m  writes.  The  format  for  accessing  the  matrix  is  the  Compressed 

Column  Storage,  by  means  of  which  the  matrix  is  characterised  by  just  three 
vectors  [1]. 

The  algorithm  also  uses  some  vectors  to  carry  out  various  intermediate  op¬ 
erations  to  compute,  the  residue,  the  successive  approximation  to  the  solution 
and  the  search  direction.  With  these  vectors  two  types  of  operations  are  carried 
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out:  linear  combinations  of  vectors  and  dot  products.  These  operations  have 
great  influence  over  the  efficiency  of  the  parallel  program.  Moreover,  the  perfor¬ 
mance  does  not  depend  on  the  type  of  distribution,  due  to  the  use  of  the  reshape 
technique. 


4  Results 


Initially  two  different  distributions  were  evaluated:  the  block  and  the  cyclic  ap¬ 
plied  to  all  of  the  vectors,  including  those  which  characterise  the  matrix.  Then, 
a  new  distribution  was  tested,  which  we  will  call  hybrid  block-cyclic  in  which 
the  vectors  that  characterise  the  sparse  matrix  are  distributed  in  blocks,  and  the 
rest  of  them  cyclically. 


Table  1.  Matrices  used  as  benchmark. 


matrix  bcsstkl4  bcsstkl7 

zenios 

random 

order 

1806 

10974 

2873 

10000 

nnz 

63454 

428650 

21842 

110576 

In  our  analysis  we  used  sparse  matrices  of  different  sizes  and  patterns.  All 
of  these,  come  from  the  Harwell-Boeing  collection  [2],  in  addition  to  one  matrix 
generated  randomly.  The  characteristics  of  them  are  shown  in  table  1. 


bcsttk14 


bcsstkl? 


Fig.  2.  Speedup  and  run  time  per  iteration. 
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In  figure  2  the  speedups  and  run  times  for  iteration  are  shown.  Note  that  the 
best  results  have  been  obtained  for  the  block  distribution,  whereas  in  the  cyclic 
case  there  are  irregularities  in  these  values  due  to  an  increase  in  the  number 
of  cache  misses.  These  irregularities  are  eliminated  with  the  use  of  the  hybrid 
distribution. 

The  performance  of  the  parallelization  of  this  algorithm  will  depend  mainly 
on  the  management  of  the  memory,  an  important  factor  being  the  volume  of 
data  accessed  by  each  processor.  The  pattern  of  access  to  the  data  in  the  sparse 
matrix-vector  product  in  each  processor  is  shown  in  figure  3,  and  is  determined 
by  the  distribution  of  vector  Y .  In  other  words,  each  processor  will  access  those 
parts  of  the  matrix  which  correspond  to  their  elements  of  vector  Y. 


(a)  BLOCK  (IjJ  CYCLIC 

Pig.  3.  Pattern  of  access  in  sparse  matrix-vector  product. 

Note  that,  in  the  block  distribution  the  access  to  the  matrix  is  performed  on 
adjacent  columns.  In  this  way,  as  the  number  of  processors  increases,  each  one 
must  access  a  smaller  number  of  pages  and  cache  lines.  In  the  case  of  a  cyclic 
distribution  of  vector  T ,  it  is  necessary  to  access  practically  all  the  pages  of  the 
matrix.  This  is  reflected  in  the  large  number  of  TLB  misses.  For  matrix  bcsstkl? 
the  number  of  pages  it  occupies  is  more  than  the  number  of  TLB  entries,  so  that 
m  a  single  processor  a  great  number  of  misses  is  generated  as  they  have  to  access 
the  whole  matrix.  When  the  number  of  processors  increases,  in  the  block  case  the 
number  of  TLB  misses  decreases,  whilst  in  the  case  of  the  cyclic  distribution  it 
remains  almost  the  same.  However,  the  main  problem  with  the  latter  distribution 
is  that  the  consecutive  elements  of  the  matrix  belong  to  different  cache  lines,  so 
that  the  number  of  lines  accessed  by  each  processor  is  much  larger  than  in  the 
block  case. 

Figure  4  shows  the  number  of  cache  misses.  In  a  block  distribution  the  lowest 
values  in  cache  misses  are  obtained.  With  a  cyclic  distribution  a  marked  increase 
in  the  number  of  cache  misses  in  the  case  of  four  processors  can  be  observed. 
The  reason  is  that  in  this  case  all  the  lines  read  by  each  processor  do  nor  fit  in 
its  cache,  thereby  producing  a  large  number  of  operations  of  replacement.  In  the 
next  iteration  these  replaced  lines  are  demanded  again,  thus  provoking  capacity 
misses  in  the  cache.  By  means  of  a  hybrid  distribution  it  is  possible  to  solve 
this  problem  to  a  great  extent,  as  consecutive  elements  in  the  matrix  will  be 
consecutive  in  memory  (except  if  they  are  assigned  to  different  processors)  and 
therefore  will  probably  belong  to  the  same  cache  line.  In  this  way  it  is  possible  to 
reduce  the  number  of  accesses  to  cache  lines,  and  then  the  replacement  problems 
have  been  eliminated.  However,  these  values  will  always  be  larger  than  in  the 
case  of  block  distributions,  given  that  it  is  necessary  again  to  access  a  greater 
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memory  space  than  in  the  block  case,  due  to  the  cyclic  distribution  of  the  vector. 
The  higher  the  number  of  non  zero  elements  the  greater  will  be  this  effect.  Thus, 
in  the  case  of  the  matrix  zenios,  as  it  has  a  small  number  of  non  zero  elements, 
the  results  archived  in  the  run  time  for  iteration  are  similar  for  all  distributions. 
The  size  of  matrices  bcsstkl?  and  random  is  higher  than  that  of  the  secondary 
cache  and  this  produces  a  large  number  of  capacity  misses  when  the  algorithm 
is  executed  in  a  single  processor.  This  also  produces  superlineal  speedups  for  the 
matrix  bcsstkl?. 


Fig.  4.  Cache  misses. 


The  main  problem  of  this  distribution  axises  when  matrices  with  non  uniform 
patterns,  such  as  zenios,  are  used,  as  they  cause  a  load  unbalance  between  the 
processors  which  operate  over  the  densest  parts  as  against  those  that  operate 
over  the  sparsest  parts.  This  can  be  noted  in  figure  5,  which  represents  the  load 
unbalance  given  hy  B  =  Cmax/Cmed,  where  Cmax  is  the  number  of  floating  point 
operations  of  the  thread  which  has  the  greater  work  load,  and  Cmed  is  the  average 
value.  High  values  of  B  limit  the  value  of  the  speedup  when  a  high  number  of 
processors  is  used.  By  means  of  the  use  of  a  cyclic  distribution,  the  problem  of 
load  unbalance  is  then  solved.  Note  that  speedup  for  the  zenios  matrix  is  more 
scalable  for  the  hybrid  distribution. 


HYBRID  -  RESHAPE 


Fig.  5.  Load  unbalance. 
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5  Conclusions 

The  use  of  regular  distributions  is  inefficient  in  sparse  systems,  given  that  in 
these  cases  the  pattern  of  the  matrices  is  not  known  at  compile  time.  By  using 
a  hybrid  distribution  the  advantages  with  regard  to  the  load  balancing  of  the 
cyclic  distribution  are  maintained,  and  the  execution  times  per  iteration  are 
similar  to  the  block  distribution.  In  this  way,  the  results  using  matrices  with 
regular  patterns  are  similar  to  the  block  distribution,  and  faced  with  matrices 
With  irregular  patterns,  the  load  unbalance  is  resolved. 
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Abstract.  We  present  an  approach  for  designing  synchronized  parallel 
algorithms  to  update  RedBlack  trees.  The  resulting  algorithms  update  k 
keys  with  k  processors  on  trees  of  size  n  in  time  0(logn  +  logit)  which 
is  very  close  to  the  optimal  speedup  of  0(log  n)  (sequential  time  for  one 
search  or  update).  The  algorithms  are  designed  as  a  pipeline  of  waves 
of  processors,  which  are  created  at  the  bottom  of  the  tree  and  flow  up 
to  the  root.  The  design  is  made  following  the  E.W.Dijkstra  approach  by 
first  choosing  the  invariant  properties  and  then  the  rules  to  update  the 
tree. 

Keywords:  Synchronized  parallel  algorithms,  PRAM  algorithms,  Red- 
Black  trees. 


1  Introduction 

The  so  called  Synchronized  parallel  algorithms  are  those  that  manage  data  types 
in  a  synchronized  manner  (PRAM  algorithms  [Akl89]).  They  can  be  envisaged 
as  many  sequential  algorithms  running  simultaneously  and  executing  the  same 
sentence  at  the  same  time.  Therefore,  it  may  happen  that  several  processes  read 
or  write  on  the  same  memory  location  at  the  same  time.  Our  goal  is  to  avoid 
these  concurrent  accesses. 

The  first  synchronized  parallel  algorithms  on  search  trees  were  designed  by 
W.  Paul,  U.  Vishkin  and  H.  Wagener  for  2-3  trees  in  1983  [PVW83].  They  proved 
that  the  time  needed  to  search  or  update  k  elements  with  k  processors  on  a  tree 
with  n  keys  is  0(logn  -b  log  A:)  which  is  very  close  to  the  optimal  speedup  of 
O(logn). 

They  designed  parallel  algorithms  to  dynamically  maintain  a  parallel  dic¬ 
tionary  working  simultaneously  with  many  keys.  The  algorithms  first  hang  the 
keys  from  the  leaves  {search  phase),  and  later  rebalance  the  tree  {rebalancing 
phase)  using  pipelines  of  processors.  These  pipelines  can  be  envisaged  intuitively 

*  This  work  has  been  partially  supported  by  ESPRIT  LTR  Project  no.  20244  — 
ALCOM-IT  and  DGICYT  under  grant  PB95-0787  (project  KOALA). 
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ni  terms  of  traveling  plane  waves.  Assume,  for  instance,  the  basic  insertion  case 
in  which  every  leaf  incorporates  at  most  one  new  key.  Something  like  a  wave  of 
processors  is  generated  at  the  bottom  of  the  tree,  namely  a  plane  wave,  because 
all  leaves  of  a  2-3  tree  have  the  same  depth.  This  wave  is  sent  up  in  further 
Iterations  until  it  disappears.^  Note  that  the  wave  goes  to  the  root  and  at  each 
1  eration  it  strictly  increases  its  height  and  decreases  its  depth.  The  life-time  of 
each  wave  i.e.  the  number  of  steps  taken  by  a  wave  before  it  disappears,  is  an 
open  problem,  but  some  preliminary  results  [BYGM97]  strongly  suggest  that  it 
IS  logarithmic  on  k. 

In  the  general  insertion  case,  in  which  a  packet  of  many  new  keys  can  hang 
rom  a  single  leaf,  a  pipeline  of  waves  is  generated  to  get  something  like  harmonic 
traveling  waves.  Each  new  wave  is  created  as  follows:  some  iterations  after  the 
last  wave  has  been  created,  the  packets  are  split,  the  middle  key  of  each  one 
i^s  attached  as  a  new  leaf  and  the  remaining  left  and  right  subpackets  are  hung 

from  the  new  leaf.  This  set  of  new  leaves  created  by  the  middle  keys  constitute 
the  new  wave. 

This  rebalancing  phase  synchronizes  the  processors  that  belong  to  the  same 
wave,  and  these  processors  locally  manage  the  data  and  test  the  conditions  to 
become  inactive  or  to  continue  one  step  more.  For  this  reason  we  say  that  pro- 
ce.ssors  are  controlled  by  Local  Rules.  These  are  sequential  algorithms  composed 
by  a  sma  1  and  fixed  number  of  sentences  that  access  a  small  number  of  neighbor 
nodes.  The  rebalancing  phase  can  be  written: 

While  there  are  active  processors  do 
For  all  waves  do 

For  all  active  processors  of  a  wave  do 
Select  2uid  apply  rules 

endforall 

endforall 

endwhile 


These  ide^  were  applied  on  B  trees  by  L.  Higham  and  E.  Schenk  [HS94].  on 
Skip  lists  by  J  Gabarro,  C.  Martinez  and  X.  Messeguer  [GMM96],  and  on  A\'L 
trees  by  J.  Gabarro  and  X.  Messeguer  [GM96]. 

The  RedBlack  trees  are  an  important  basic  data  structure,  namelv  a  balanced 
binary  search  tree,  which  implements  the  dfctionart/ abstract  data  type  The  bal¬ 
ancing  criterion  differentiates  RedBlack  trees  from  2-3  trees,  because  it  does  not 
rce  the  tree  to  be  perfectly  balanced:  it  is  possible  to  deal  with  RedBlack  trees 
whose  leaves  have  significantly  different  depth.  Therefore,  it  could  be  difficult  to 

sjmchronize  the  processors  of  a  wave  because  there  is  no  obvious  way  to  create 
plane  w-aves. 

I  1  in  tins  paper  the  design  of  the  synchronized  insertion  paral¬ 

lel  algorithm  on  RedBlack  trees  with  the  same  cost  O(logn  -f  \ogk).  and  the 
exclusive-read  and  exclusive-write  policy  (EREW  [Akl89]). 
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We  omit  the  search  phase  of  the  update  algorithms  because  it  is  well  known 
(see  previous  references).  We  only  design  the  rebalancing  phase  of  this  algorithm 
in  which  k  keys  are  updated  with  k  processors.  The  deletion  algorithm  can  easily 
be  designed  using  the  same  technique. 

We  prove  the  algorithm  correctness  following  the  approach  developed  by 
E.W.Dijkstra,  in  which  the  proofs  are  based  on  the  preservation  of  some  proper¬ 
ties,  called  invariants,  at  each  iteration,  and  the  strict  decreasement  of  a  function, 
called  variant  function,  at  each  iteration.  This  approach,  very  common  in  basic 
sequential  algorithmic  courses,  has  not  been  applied  yet  on  parallel  algorithms 
on  balanced  search  trees. 

The  rest  of  paper  is  organized  as  follows.  Section  2  recalls  RedBlack  trees. 
Section  3  addresses  the  synchronized  insertion  algorithm.  Finally  section  4  shows 
the  local  rules  of  the  algorithm. 

2  RedBlack  trees. 

Following  [CLR90],  each  node  n  of  a  RedBlack  tree  stores  a  key,  denoted  key(n), 
and  each  internal  node  has  three  pointers  left(n),  right(n)  and  parent(n)  point¬ 
ing  respectively  to  its  sons  and  parent.  A  RedBlack  tree  satisfies  the  following 
properties: 

Pi  :  Every  node  is  either  red  or  black. 

P2  :  Every  leaf  (NIL)  is  black. 

P3  :  If  a  node  is  red  then  both  its  children  are  black.  This  is  equivalent  to,  no 
path  from  the  root  to  a  leaf  contains  two  consecutive  red  nodes. 

P4  :  Every  simple  path  from  a  node  to  a  leaf  contains  the  same  number  of  black 
nodes. 

The  last  condition  P4  allows  the  definition  of  the  function  called  black-height  in 
[CLR90]: 


blackh(n)  =  the  number  of  black  nodes  on  anj'  path  from, 
but  not  including,  a  node  n  to  a  leaf. 


We  recall  the  sequential  insertion  algorithm: 

1.  Search  phase.  The  key  to  be  inserted  falls  until  it  is  attached  to  a  new  red 
node  n  at  the  bottom  of  the  tree.  As  this  new  node  n  is  red,  the  property 
P4  is  maintained. 

2.  Rebalancing  phase.  If  the  parent  of  n  is  black  P3  holds  and  the  insertion 
is  over.  Otherwise,  n  and  parent(n)  are  red  and  the  bottom-up  rebalancing 
phase  really  starts.  By  performing  rotations  and  node  recoloring,  the  redness 
of  consecutive  nodes  disappears  or  rises  up.  Finally,  if  the  root  becomes  red 
it  is  colored  black.  Figure  1  depicts  the  local  rules  applied  in  this  pha,se. 
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Fig.  1.  The  three  basic  local  rules  (under  symmetry)  of  the  sequential  algorithm.  The 
first  rule  (a)  propagates  up  the  redness  and  the  foUowing  two  rules,  (6)  and  (c),  rotate 
down  the  blackness.  ’ 


3  Synchronized  parallel  insertion 

As.sume  that  the  parallel  search  phase  has  ended  and  that  the  packets  of  keys 
hang  from  the  leaves.  We  force  each  iteration  of  the  rebalancing  phase  to  hold 
the  following  invariants: 

h-.  Properties  Pi,  Pj  and  P4  of  RedBlack  trees. 

h'  Only  those  red  nodes  whose  parent  is  also  red  have  an  active  processor.  We 
identify  the  node  with  its  processor,  then  we  sometimes  talk  about  “active 
nodes”.  Therefore,  when  there  are  no  active  nodes  property  P3  holds,  and 
by  Ii  the  tree  is  a  RedBlack  tree. 
h:  All  active  processors  of  a  wave  have  the  same  black-height, 

'ip,  q  6  wave  :  blackh(p)  =  blackhf^). 

This  property  allows  us  to  define  the  black-height  of  a  wave  w: 

blackh(tn)  =  blackh(p)  for  any  p  such  that  pE  w. 

h:  The  black-height  of  the  last  created  wave  is  at  least  two.  This  property 
means  that  if  the  black-height  of  every  wave  gets  increased  by  one  unit  at 
each  iteration,  then  between  two  consecutive  waves  there  is  at  least  one  black 
node.  Therefore,  if  an  active  node  has  a  grandparent  gr,  then  gr  is  black. 

The  variant  function  involves  the  number  of  keys  hanging  at  leaves,  denoted 
A  API  5,  and  the  sum  of  the  depths  of  all  existing  waves,  denoted  DEPTH. 
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(a)  Two  active  nodes  that  are  brothers  (b)  All  other  cases  (e.g.:  four  active  nodes 


Fig.  2.  The  two  basic  new  local  rules  (under  symmetry)  of  the  parallel  algorithm 


Namely,  it  is  defined  by  the  ordered  pair  {N KEYS,  DEPTH).  It  strictly  de¬ 
creases  at  each  iteration  because  new  keys  are  attached  to  the  tree,  and  when 
there  are  no  keys,  we  force  waves  to  strictly  increase  their  black-height. 

Each  iteration  is  composed  of  two  separate  actions:  [i)  the  creation  of  a  new 
wave  and  {ii)  the  moving  up  of  all  waves. 

(?■)  .4  wave  is  created  by  selecting  the  middle  key  of  each  packet  and  attaching 
it  into  a  new  red  node,  so  /j  holds.  Each  new  red  node  n  is  controlled  by 
an  active  processor  Then  active  processors  test  their  parents  color  and 
become  inactive  if  it  is  black,  so  I2  holds.  As  all  nodes  of  the  last  created 
wave  satisfy  blacldi(n)  =  1  (black  leaves  hang  from  them),  h  holds. 

[ii]  .4ctive  processors  run  local  rules  which  will  be  showed  in  the  following 
section.  We  design  them  so  they  satisfy  the  the  previous  invariant  and  so 
they  increase  the  black-height  of  all  waves.  Finally,  we  again  update  the 
active  nodes. 


4  Local  rules  for  insertion 

Let  us  deal  now  with  the  rules  we  apply  to  make  the  waves  go  up.  If  there  are 
active  nodes  without  a  grandparent,  we  simply  turn  the  root  black.  For  each 
active  node  n  with  a  grandparent  (that  is  black,  by  74)  we  consider  the  area 
defined  by  its  grandparent  gp,  and  the  sons  and  grandsons  of  gp.  In  this  area  we 
can  have  active  nodes  other  than  n,  but  in  any  case  they  are  all  grandsons  of  gp 
and  belong  to  the  same  wave,  by  I^.  Depending  on  the  number  of  active  nodes 
in  the  area  we  apply  one  rule  or  another. 

If  the  grandparent  of  an  area  has  only  one  active  grandson  we  are  in  the  same 
situation  as  the  sequential  case  so  we  can  try  the  same  rules  (.see  [CLR90])  and 
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check  if  they  satisfy  the  invariants.  If  the  grandparent  has  more  of  one  active 
grandson  we  are  in  a  specifically  parallel  case  so  we  need  new  rules.  For  every 
area  in  this  situation  we  need  to  select  one  representative  of  its  active  nodes  so 
we  can  apply  the  rules  with  only  one  processor.  Note  that  counting  the  active 
nodes  in  an  area  and  selecting  a  representative  may  lead  us  to  a  concurrent  read 
situation.  "We  avoid  that  possibility  by  just  properly  sequentializing  that  process. 

In  the  sequential  case  w-e  have  three  rules  (see  Figure  1):  in  (a)  we  mov'e  the 
wave  up  just  by  recoloring.  Note  that  the  number  of  black  nodes  of  each  path 
does  not  change  but  the  variant  function  decreases,  because  the  black-height 
of  the  wave  (wdiose  only  node  is  now  the  grandparent)  is  one  unit  higher  than 
before.  In  (b)  and  (c)  w'e  need  both  rotations  and  recoloring.  The  number  of 
black  nodes  of  each  path  does  not  change  and  the  active  nodes  become  inactive. 

In  the  parallel  case  we  find  two  new  situations:  if  we  have  two  active  nodes 
that  are  brothers  (Figure  2(a))  we  need  one  rotation  and  recoloring;  otherw-ise 
(Figure  2(b))  recoloring  is  enough,  because  both  parents  are  red.  Again  the  wave 
moves  up  one  level  without  changing  the  number  of  nodes  of  any  path. 

Summing  everything  up,  in  all  cases  the  active  nodes  of  a  wave  move  up  one 
level  (their  black-height  increases  one  unit)  or  they  become  inactive,  which  means 
that  the  variant  function  actually  decreases  and  I3  holds.  The  last  created  wave 
has  now  black  height  two  {I4).  We  also  guarantee  that  every  path  from  every 
node  to  a  leaf  has  the  same  number  of  black  nodes,  so  we  preserve  ly.  Finally, 
as  W'e  keep  updating  the  active  nodes,  we  also  satisfy  I2. 
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Abstract.  In  this  paper  we  try  to  show  that  speeding  up  Geographical 
Information  Systems  (GIS)  by  their  process  in  parallel  architectures  is 
possible.  A  spatial  data  partitioning  and  subdivision  scheme  is  proposed, 
to  process  GIS  data  in  a  distributed  memory  parallel  machine.  We  also 
provide  solutions  to  classical  problems  in  GIS  systems  and  parallel  pro¬ 
cessing,  such  as  data  boundary  matching,  and  how  to  distribute  and 
assign  data  among  different  processors  to  optimize  both  results  quality 
and  communication  time.  Finally,  we  show  results  obtained  with  different 
kind  of  hardware  platforms:  a  net  of  computers  orgcmized  in  a  cluster, 
and  a  massive  parallel  machine. 


Key  words:  parallelization,  data  partitioning.  Geographic  Information  Sys¬ 
tems  (GIS),  massive  parallel  processors  (mpp),  multicomputers 

1  Introduction 

Geographic  information  is  characterized  by  its  distribution  over  terrain  surface. 
This  data  organization  makes  their  projection  over  am  horizontal  plane  a  good 
data  model  to  be  recorded  and  handled.  We  also  must  realize  that  most  of  process 
with  these  data  is  done  considering  parameters  related  with  terrain  surface  [6]. 

A  great  deal  of  GIS  algorithms  (visualization,  data  interpolation,  DTM  gen¬ 
eration  from  contour  lines,  intervisibility,  shapes,  planning,  analysis,  scheduling, 
retrievals,  etc)  do  calculations  on  data  representing  a  terrain  characteristic,  and 
therefore,  easily  structured  as  information  data  layers. 

Both  retrieval  and  data  process  of  this  information  layers  will  be  done  on 
a  delimited  area  of  terrain  surface,  considering  only  spatially  close  data.  This 
data  neighbourhood  property  zdlows  their  process  in  parallel  in  a  distributed 
system  with  not  many  communication  requirements.  Tasks  may  be  distributed 
following  data  partitions  of  terrain  surface,  in  such  a  way  that  partitions  may 
be  close  to  the  proper  subset  of  processors,  though  is  not  to  others.  A  good 
data  partitioning  scheme  among  processors  will  allow  parallel  work  with  certain 
autonomy  distributed  memory  multiprocessor). 


705 


FEUP  -  F aculdade  de  Engenharia  da  Universidade  do  Porto 


2  Geographic  Information  Systems  (GIS) 

Geographic  Information  Systems  handle  spatial  information  with  a  particular 
behaviour.  Geographic  information  includes  cartographic  or  graphic  elements, 
but  also  alphanumeric  attributes.  Main  conceptual  models  in  GIS  are;  vector 
and  raster  models.  Vector  format  uses  line  as  graphic  primitive,  while  raster 
format  uses  point. 

A  vector  representation  of  a  geographical  information  uses  points,  lines,  poly¬ 
lines  and  polygons  as  geometrical  primitives.  Attributes  are  linked  to  the  geom¬ 
etry.  In  a  raster  representation,  the  information  is  projected  into  a  grid,  each 
grid-point  defining  the  location  and  the  attribute  of  the  location. 

Raster  representation  usually  requires  more  memory,  but  on  the  other  hand, 
yields  a  spatial  distribution  more  homogeneous.  This  format  has  been  more  used 
than  vector  one,  due  to  the  fact  that  most  algorithms  to  be  applied  to  this  kind 
of  data  are  more  efficient  with  this  format, 

Consequently,  a  classic  problem  in  GIS  is  the  huge  requirements  of  memory  to 
storage  geographical  data,  with  all  their  consequences;  high  access  times,  memory 
bandwidth  saturation,  concurrency  problems,  etc.  These  disadvantages  would 
be  considerably  reduced  with  several  processors  working  in  parallel,  following  a 
distributed  memory  scheme  [9]. 


3  GIS  algorithms  parallelization 

The  presented  solution  is  based  on  spatial  parallelism,  partitioning  data  domain 
in  square  or  rectangular  partitions.  Every  rectangular  partition  is  assigned  to 
a  virtual  process  node.  This  kind  of  data  distribution  is  also  named  domain 
decomposition  [2].  Work  to  be  done  is  assigned  to  the  the  processor  whose  data 
are  sited  in,  and  that  processor  may  communicate  its  neighbours  as  necessary. 

An  optimal  data  partition  will  optimize  communications  among  different  pro¬ 
cess  virtual  nodes.  But  these  processors  should  communicate  others  when  they 
require  data  sited  on  others.  In  general  purpose  applications,  this  communication 
overhead  becomes  a  great  and  serious  bottleneck. 


3.1  Communication  requirements 

Communications  requirements  due  to  data  partitioning  in  a  distributed  Geo¬ 
graphical  Information  System  are; 

-  Initial  data  partition  distribution 

-  Data  boundary  partitions  matching  problem 

-  Connectivity  and  neighbourhood  algorithms 

Some  geographic  data  analysis  may  be  done  in  parallel  on  different  data 
partitions  with  some  autonomy.  But  connectivity  and  neighbourhood  analysis 
evaluate  characteristics  over  an  area  that  may  cover  several  adjacent  partitions. 
Therefore,  processors  may  require  data  from  adjacent  nodes. 
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Choosing  the  partition  grain  size  is  very  important  and  not  trivial.  Partition¬ 
ing  grain  must  allow  enough  number  of  partitions  to  get  the  benefits  of  paral¬ 
lelism,  but  also  partitions  must  be  big  enough  to  provide  a  minimum  autonomy 
of  work  in  case  of  connectivity  or  neighbourhood  algorithms.  Communications 
will  be  limited  to  the  subgroup  of  adjacent  nodes. 

We  propose  a  parallelization  scheme  that  reduce  this  communication  and 
solve  the  data  boundary  matching  problem,  trying  to  get  the  most  benefit  of 
parallelism.  The  proposed  scheme  minimizes  communications  even  with  a  fine 
grain  of  parallelism. The  solution  is  based  in  what  we  call  the  search  area. 

3.2  Search  Area 

The  search  area  is  a  set  of  data  surrounding  every  partition  which  is  sent  to  a 
processor  to  help  calculations  near  partition  boundaries.  Data  and  process  of 
search  area  is  really  assigned  to  other  processor,  and  are  for  read  only  to  the 
partition  which  is  around. 

The  search  area  is  in  fact  an  overlapped  region  replicated  in  several  process¬ 
ing  nodes.  But  only  one  processor  should  write  on  it.  A  synchronization  and 
communication  protocol  is  needed  to  guarantee  data  coherence  and  atomicy. 

The  spatial  parallelization  scheme  includes  search  area  management,  cre¬ 
ation,  and  updating,  as  schematically  showed  in  the  following  steps  (fig.  3): 

1.  Data  assign  and  distribution  among  virtual  processing  nodes 

2.  Search  area  creation 

3.  Local  process  in  parallel  of  the  partitions  considering  each  processing  node 
its  partition  and  its  search  area. 

4.  Search  area  updating,  considering  results  already  obtained  in  adjacent  nodes. 

5.  Optional  boundaries  data  correction  at  each  partition,  considering  search 
area  already  updated 


Fig.  1.  Execution  time  in  T3E  with  increasing  data  sizes. 
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4  Results 

We  have  implemented  our  proposed  parallelization  scheme  in  different  parallel 
hardware  platforms  because  we  wanted  to  propose  a  general  scheme,  indepen¬ 
dent  of  the  hardware:  a  cluster  with  several  computers,  and  a  massive  parallel 
processor. 

Our  main  goal  with  this  implementation  is  not  only  to  demonstrate  our  par¬ 
allelization  scheme  works  properly,  but  also  showing  that  contiguity  and  neigh¬ 
bourhood  algorithms  may  also  be  parallelized  without  loosing  information  and 
therefore  without  significantly  communication  overhead.  We  have  concretized 
our  tests  for  spatial  interpolation  from  contour  lines,  one  of  the  most  represen¬ 
tative  neighbourhood  algorithm  in  GIS. 

We  have  also  studied  influence  in  time  and  quality  results  of  different  pa¬ 
rameters  in  partitioning  scheme,  such  as:  size  and  number  of  data  partitions, 
number  of  real  processor  nodes,  search  area  size,  etc. 

We  have  tested  the  following  two  hardware  platforms:  a  cluster  with  4  RS6000 
(programming  model  PVM);  and  a  massive  parallel  processor:  T3E  (Cray),  with 
up  to  32  processors  (programming  Model  of  shared  variables  (HPF)). 

In  both  platforms  we  have  analyzed  both  quality  of  results  and  the  execution 
time. 

Analyzing  results  quality,  we  studied  the  proper  search  area  size  and  the  par¬ 
titioning  grain  (partitions  size).  Obviously,  we  obtained  the  same  results  quality 
at  both  hardware  platforms. 

We  established  that  the  search  area  size  depends  on  data  distribution.  For 
spatial  interpolation,  we  estimated  that  the  search  area  size  should  obey  the 
following  expression: 


1  _  psas  >  Q  g 

where  sas  is  the  search  area  size  expressed  in  number  of  data  rows,  and  P  is 
the  density  of  points  with  known  latitude  in  input  data. 

This  minimum  search  area  size  guarantees  quite  similar  results  quality  near 
boundaries  partitions  than  in  sequential  processing. 

We  also  got  that  with  this  search  mea  size,  partitioning  grain  could  be  fine 
to  get  the  benefits  of  parallelism.  Therefore,  the  best  partition  size  is  determined 
by  the  number  of  available  real  processors. 

About  execution  time,  results  were  rather  different  in  the  two  platforms.  Tests 
in  the  cluster  show  that  time  processing  heavily  depends  on  the  network  load. 
This  network  may  become  soon  a  bottleneck,  and  speed  up  with  4  processors  is 
not  really  spectacular  for  small  files. 

But  our  tests  also  show  that  speed  up  improves  as  data  size  grows  up  (fig.  1). 
This  is  important,  as  these  systems  (characterized  by  managing  huge  quantities 
of  data)  are  involving  more  an  more  data. 

Execution  times  with  the  mpp  of  Cray  are  quite  better,  with  a  high  speed  up, 
thanks  to  a  higher  number  of  available  real  processors,  and  a  better  and  dedicated 
communication  links.  However,  due  to  the  fact  that  I/O  was  in  this  machine 
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sequential,  total  speed  up  is  very  influenced  by  sequential  I/O.  Considering  just 
interpolation  time,  excluding  I/O  operations,  the  speed  up  for  16  processors  were 
near  11,  what  is  good  (fig.  2  and  3). 
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Fig.  2.  Interpolation  times  and  speed  up  for  Cray  T3E. 


5  Conclusions 


A  general  parallelization  scheme  based  on  data  partitioning  is  presented.  The 
proposed  scheme  minimize  communication  between  process  nodes,  thanks  to  the 
search  area  concept  introduced  and  therefore,  response  times  are  considerably 
reduced.  The  proposed  scheme  also  presents  solution  to  classical  problems  in 
GIS,  such  as  data  boundary  matching  in  spatial  subdivision,  the  influence  of 
partitioning  grain  in  quality  and  time  of  results,  and  data  assignment  to  the 
different  processor  nodes.  Execution  times  are  significantly  better  in  the  massive 
parallel  machine,  where  communications  are  not  the  bottleneck  (in  the  cluster 
the  network  is  a  serious  bottleneck).  However,  a  parallel  I/O  file  system  is  truly 
recommended  when  massive  parallel  processors  are  working. 
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Fig.  3.  I/O,  interpolation  and  search  area  updating  times  for  Cray  T3E. 
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Abstract.  Current  processors  use  special  techniques  to  improve  perfor¬ 
mance  such  as  pipeline  and  multiple  instruction  issue  per  cycle. 

Using  a  real  pipeline  or  superscalar  computer  to  teach  these  concepts  is 
actually  impractical,  because  these  computers  are  designed  to  be  pro¬ 
grammed  in  high-level  languages. 

Hence,  we  have  implemented  a  superscalar  processor  emulator,  where 
most  of  the  processor  parameters  can  be  defined  by  the  student.  Its  ob¬ 
jective  is  to  create  a  set  of  laboratory  works  allowing  the  student  to 
observe  the  execution  evolution  of  his  assembly  prograim  through  the 
different  components  of  the  computer,  detecting  the  different  kinds  of 
hazards  and  their  impact  on  performance.  Then,  the  student  can  ap¬ 
ply  some  software  techniques  to  avoid  them.  Moreover,  he  can  obtain 
statistics  about  caches. 

Keywords:  education,  pipeline,  superscalar,  cache  memory,  emulator. 


1  Introduction 

This  paper  presents  a  superscalar  processor  (MC88110)  emulator  that  we  have 
implemented  to  teach  classical  and  modern  Computer  Architecture  concepts  at 
the  Facultad  de  Informatica  of  the  U.P.M. 

The  motivation  which  lead  us  to  develop  a  new  emulator,  instead  of  using 
existing  ones,  is  simple.  We  wanted  an  educational  tool  which  could  serve  us 
to  make  different  practical  works  in  which  we  could  increase  the  complexity  of 
the  concepts  we  want  to  cover.  In  a  first  stage,  we  want  to  use  the  emulator  to 
teach  assembly  programming  and  later  to  teach  cache  behavior,  and  pipeline  and 
superscalar  computer  concepts.  Caches  can  be  inhibited  in  beginner’s  laboratory 
works  for  avoiding  memory  hierarchy  concepts. 

Although  some  educational  emulators  which  could  serve  for  our  practical 
works  are  available  (spim,  cl-spim,  Dlx,  DineroIII  and  SuperDlx),  they  are  ori¬ 
ented  to  specific  purposes.  We  were  also  looking  for  a  tool  running  on  conven¬ 
tional  Unix  stations  and  on  personal  computers  with  Linux. 
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Nowadays,  the  emulator  is  being  used  for  laboratory  works,  to  teach  assembly 
programming  using  a  RISC  approach,  cache  behavior,  pipeline  and  superscalar 
computers  concepts.  It  is  available  for  Solaris,  Aix  and  Linux  operating  systems. 

The  emulator  has  an  embedded  debugger.  It  allows  the  user  to  control  the 
program  execution,  and  to  observe  the  state  of  the  different  components  of  the 
computer  at  every  clock  cycle.  The  user  can  set  breakpoints,  execute  the  whole 
program  or  just  a  cycle,  display  and  modify  registers  and  memory  contents,  and 
display  the  instructions  at  the  different  pipeline  stages  and  the  history  buffer 
contents. 

The  emulator  has  currently  a  textual  interface,  although  an  X-window  based 
interface  that  will  provide  equivalent  functionality  is  being  finished. 


2  Emulator  description 

The  system  emulates  the  functional  units  and  behavior  of  the  MC88110  oro- 
cessor.  We  chose  the  MC88110  because  at  the  beginning  of  this  project  (1993) 
this  processor  had  recently  appeared  and  there  was  good  documentation  about 
It.  It  included  the  most  interesting  characteristics  of  superscalar  processors  like 
out-of-order  completion  of  instructions,  branch  prediction,  a  mix  of  in-order  and 
out-of-order  issue,  and  used  shelving  for  some  instructions. 

This  superscalar  processor  can  issue  two  instructions  every  clock  cycle  a 
suitable  throughput  for  our  purposes.  Instructions  are  issued  in  the  order ’in 
which  they  appear  in  the  program,  but  they  can  be  finished  out-of-order  due  to 
the  different  functional  units  latency.  The  processor  also  implements  a  partial 
out-of-order  issue  model  for  branch  and  store  instructions,  that  can  be  issued 
even  when  its  operands  are  not  available. 

Instructions  are  dispatched  to  ten  different  functional  units  that  work  in 
parallel,  although  the  two  graphics  units  have  not  been  emulated. 

The  instruction  pipeline  is  a  conventional  four  stages  RISC  pipeline: 

-  Fetch  Two  instructions  are  read  together  from  the  instruction  cache 

-  Decode.  The  instructions  previously  read  are  decoded  and  their  source  reg¬ 
isters  are  read  from  the  register  file.  The  branch  target  address  is  computed 
to  perform  static  branch  prediction. 

-  Execution.  If  the  operands  and  functional  units  are  available,  both  instruc¬ 
tions  are  dispatched  and  executed.  At  this  stage  branch  instructions  compute 

the  branch  condition  while  load  and  store  instructions  execute  their  memorv 
accesses. 

-  Write  back.  The  execution  results  are  written  into  the  register  file. 

Latency  is  defined  to  be  one  cycle  for  all  except  for  the  execution  stage.  In 
this  case  it  depends  on  which  functional  unit  is  involved. 

The  evolution  of  instructions  through  pipeline  stages  can  be  displaved  at  ev¬ 
ery  machine  cycle,  marking  explicitly  those  executed  due  to  a  branch  prediction 
Instructions  dispatching  can  be  stalled  due  to  structural,  data  or  control  haz¬ 
ards.  The  sequencer  dispatches  instructions  according  to  the  order  in  which  they 
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appear  in  the  program,  except  for  store  and  branch  instructions.  In  these  cases 
their  functional  units  have  two  reservation  stations,  avoiding  these  instructions 
to  produce  stalls  in  the  pipeline  due  to  data  dependencies. 

In  order  to  diminish  the  overhead  produced  by  structural  hazards  most  of  the 
functional  units,  except  divide,  are  pipelined,  and  a  two  writing  ports  register  file 
has  been  implemented.  Also,  there  are  two  caches,  so  implementing  a  Harvard 
architecture.  Most  of  the  cache  parameters  are  also  configurable;  cache  access 
time,  whole  and  line  sizes,  organization  policies  and  write  policy. 

Both  the  actual  processor  and  the  emulated  one  include  the  scoreboarding 
mechanism  to  track  RAW  and  WAW  data  dependencies.  Recent  superscalar 
processors  include  hardware  mechanisms  to  eliminate  WAW  dependencies  by 
register  renaming.  The  inclusion  of  this  hardware  mechanism  makes  tracking 
of  program  execution  harder,  which  we  do  not  consider  appropriate  due  to  the 
academic  purpose  of  the  emulator.  We  deal  with  register  renaming  statically, 
that  is,  at  programming  time. 

Concerning  control  hazards,  the  emulated  processor  includes  delayed  branch 
instructions  (one  slot)  as  well  as  static  branch  prediction  in  the  decode  stage. 
This  allows  the  student  to  use  the  branch  instructions  available  in  the  instruction 
set  to  make  their  own  predictions,  comparing  performance.  The  instructions 
fetched  due  to  a  branch  prediction  are  tagged  (conditionally  executed).  If  the 
prediction  was  correct,  the  instructions  that  have  been  predicted  are  untagged 
and  they  are  converted  to  normal  instructions.  If  a  missprediction  has  been 
detected,  tagged  instructions  are  aborted. 

The  emulator  also  implements  the  MC88110  history  buffer,  a  FIFO  queue 
storing  the  issued  instructions  in  the  program  order  and  the  previous  value  of 
the  destination  register,  in  order  to  restore  the  state  previous  to  their  execution 
when  there  is  a  missprediction. 

When  the  first  instruction  of  the  history  buffer  completes  its  execution,  the 
sequencer  removes  every  instruction  completed.  If  the  instruction  becoming  the 
head  of  the  history  buffer  is  a  branch  whose  prediction  failed,  all  the  tagged 
instructions  are  removed  and  the  values  of  their  destination  registers  are  restored 
to  those  saved  in  the  history  buffer. 


3  Program  execution  debugging  and  visualization 

We  have  developed  an  Assembler  which  generates  the  binary  files  used  by  the 
emulator.  This  Assembler  allows  using  a  wide  instruction  subset  of  the  actual 
MC88110,  as  well  as  some  pseudoinstructions  specified  in  IEEE-694  standard 
(org,  res  and  data). 

Figure  1  shows  an  assembly  program  fragment  that  performs  the  dot  product 
of  two  vectors  (VI  and  V2).  For  instance,  the  instruction  bbl.n  3,  r3,  loop 
branches  if  the  third  bit  of  r3  (r4  not  equal  rO)  is  set.  This  instruction  predicts 
that  the  branch  will  be  be  taken.  The  suffix  .n  means  the  following  instruction 
will  be  executed  before  taking  the  branch  (delayed  branch). 


713 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


and  r8,  rO,  rO 
loop:  Id  r5,  rl,  rO 
Id  r6,  r2,  rO 
sub  r4,r4,l 
add  rl,  rl,  4 
mulu.d  r9,  r5,  r6 
add  r2,  r2,  4 
cmp  r3,  r4,  rO 
add. CO  r7,  r7,  rlO 
bbl.n  3,  r3,  loop 
add.ci  rS,  r8,  r9 
st.d  r7,  rll,  rO 
error:  stop 


;r8  contains  the  dot  product 

;r5  y  r6  are  loaded  with  an  element 

;of  both  vectors 

;The  counter  is  decremented 

;Vl’s  pointer  is  incremented 

.•Multiply  result  is  on  r9  and  rlO 

;V2’s  pointer  is  incremented 

;The  result  of  mulu  is  accumulated 
:if  r4  <>  0  then  branch  to  loop 


;End  of  emulation 


Fig.  1.  Assembly  program  to  perform  the  dot  product  of  two  vectors 


The  embedded  debugger  allows  the  user  to  control  program  execution.  Every 
time  the  program  shows  the  prompt  to  the  user,  the  emulator  displays  the  pro¬ 
cessor  internal  state:  register  contents,  status  register  and  pipeline  state.  Figure 
2  shows  the  information  provided  by  the  emulator:  current  instruction,  program 
counter  (PC),  register  file  (only  selected  registers),  processor  status  register 
some  cache  statistics  and  the  pipeline  state.  Also  the  contents  of  the  history 
butter  at  that  instant  can  be  visualized. 


PC=64 
FL=1  FE=1 


add  r01,r01.4 
FC=0  FV=0  FR=0 


Tot.  Inst:  13  Cycle  :  31 


uuuuuur^  h  R02 


- -  -  vuwoifyc  n  RU4  =  000000 

R05  =  00000000  h  R06  =  00000000  h  R07  =  00000000  h  R08  =  000000^ 

R09  =  00000000  h  RIO  =  00000000  h  Rll  =  0000006C  h  R12  =  OOOOOOi 

Instruction  cache  ;  9  accesses,  3  misses.  Hit  ratio  66.6 
Data  cache  :  2  accesses,  1  misses.  Hit  ratio  50.0 


h 

h 

h 


FETCH: 

DEC: 

EXEC: 

WBCK: 


History  buffer 

52 

Id 

56 

Id 

60 

sul 

68 

mulu. 

64 

add 

56 

Id 

52 

Id 

60 

sub 

contents : 

r05.r01,r00 

r06,r02,r00 

r04,r04,l 


r09,r05,r06 

r01.r01,4 

r06,r02,r00 

rOS.rOl.rOO 

r04,r04,l 


Not  executed 
Not  executed 
Not  executed 


R05:  00000000 
R06:  00000000 
R04:  OOOOOOOA 


Fig.  2.  Emulator  state  after  executing  30  machine  cycles 
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FETCH: 

C 

56 

Id 

r06,r02,r00 

C 

52 

Id 

r05,r01,r00 

DEC: 

c 

92 

St 

r07,rll,r00 

c 

88 

add . ci 

r08,r08,r09 

EXEC: 

84 

bbl.n 

03,r03,-8 

80 

add . CO 

r07,r07,rl0 

WBCK: 


History  buffer  contents: 


80  add. CO  r07,r07,rl0  Not  executed  R07:  00000000 

84  bbl.n  03,r03,-8  Not  executed 

Fig.  3.  Emulator  state  after  executing  52  machine  cycles 


When  the  first  loop  iteration  finishes  (see  Figure  3),  the  instruction  bbl  has 
been  issued  to  the  branch  unit.  Previously  an  effective  branch  has  been  predicted. 
So,  instructions  stored  at  addresses  52  and  56  are  tagged  as  conditional.  As  the 
prediction  was  right,  the  branch  unit  will  remove  that  tags  at  the  end  of  this 
cycle.  Furthermore,  the  tag  of  the  instruction  88  will  be  removed  because  it  is  a 
delayed  branch.  On  the  other  hand,  the  instruction  92  will  be  aborted.  The  final 
pipeline  state  is  shown  in  figure  4. 


FETCH: 

64 

add 

r01,r01,4 

60 

sub 

r04,r04.1 

DEC: 

56 

Id 

r06,r02,r00 

52 

Id 

r05,r01,r00 

EXEC: 

88 

add. ci 

r08,r08,r09 

WBCK: 

80 

add . CO 

r07,r07,rl0 

84 

bbl.n 

03,r03,-8 

History  buffer  contents 


80 

84 

add. CO 

bbl.n 

r07,r07,rl0 

03,r03,-8 

Not  executed 

Not  executed 

ROT: 

00000000 

88 

add.ci 

r08,r08,r09 

Not  executed 

R08: 

00000000 

Fig.  4.  Emulator  state  after  executing  53  machine  cycles 


4  Conclusions 

This  paper  presents  a  superscalar  processor  emulator  for  educational  purposes. 
Most  of  the  processor  parameters  are  fully  configurable,  so  it  may  be  used  to 
teach  cache  behavior  as  well  as  pipeline  and  superscalar  computer  concepts. 
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Fig.  5.  emSSllO  emulator  X-window  interface 


The  emulator  has  currently  a  textual  interface  but  we  are  implementing  an 
X-window  based  one  (Figure  5  shows  the  information  it  will  provide). 

Currently  we  are  improving  the  emulator  to  allow  selecting  the  number  of 
instructions  issued  per  cycle.  The  student  will  be  able  to  choose  whether  one 
or  two  instruction  will  be  issued,  in  order  to  emulate  a  conventional  pipelined 
machine  or  a  superscalar  one. 
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Abstract.  In  this  paper  we  present  the  results  we  have  obtained  after  applying 
a  parallel  genetic  algorithm  (PGA)  to  the  Multi-FPGA  partitioning  problem. 
Solutions  are  based  on  Xilinx  3000  series  FPGA’s  and  satisfy  some  constraints 
allow  the  routing  within  the  set  of  FPGA  that  constitutes  the  Multi-FPGFA 
system.  To  verify  our  studies  we  have  used  circuits  from  Partitioning 
Benchmark93  at  the  NCSU  CAD  Benchmarking  Laboratory.  The  experimental 
results  have  been  obtained  using  the  CRAY  T3E. 


1. Introduction 


Nowadays,  FPGA  systems  are  widely  used  because  of  their  prototyping  and 
correction  capabilities.  Every  day,  new  FPGA's  are  appearing  in  the  market  with 
higher  density  integration  and  tools  with  wider  capabilities.  However,  these 
increasing  capabilities  do  not  support  all  the  necessities  of  some  designs,  so  it  is 
necessary  to  distribute  these  designs  among  several  FPGA's.  This  is  the  major  reason 
for  Multi-FPGA  systems  [1].  The  first  step  in  the  design  flow  is  to  partition  the 
system.  In  other  words,  we  have  to  decide  how  many  FPGA’s  are  needed  to 
implement  the  system,  their  type  and  their  distribution.  We  present  a  PGA  to  solve  the 
partitioning  problem  of  Multi-FPGA  systems.  These  algorithms  have  been 
successfully  used  in  other  optimization  problems  [2].  If  the  partition  process  precedes 
the  technology  mapping,  it  is  called  functional  partitioning,  otherwise  it  is  called 
structural  partitioning  [3]. 

In  the  case  of  structural  partitioning  of  Multi-FPGA  systems,  this  method  allows  us 
to  obtain  solutions  with  a  great  number  of  blocks.  We  can  also  use  industrial  tools, 
such  as  XACT  [4],  to  accomplish  the  first  stages  of  the  design  flow.  Our  partitioning 
algorithm  then  divides  the  results  obtained  after  using  XACT,  on  initial  system 
specifications. 

In  the  area  of  Multi-FPGA  system  partitioning  there  are  a  few  tools  which  involve 
constraints,  e.g.,  Kuznar's  research  [5][6].  Its  major  drawback  is  that  this  method  has 
been  designed  for  heterogeneous  systems  but  the  implementation  is  undertaken  on 
homogeneous  systems. 
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This  paper  is  organised  as  follows.  In  Section  2  we  describe  a  parallel  genetic 
algorithm.  In  Section  3  we  show  its  application  to  the  Multi-FPGA  system- 
partitioning  problem  and  the  experimental  results  are  presented  in  section  4.  The 
paper  ends  with  some  conclusions  and  futures  research. 


2.  Parallel  Genetic  Algorithms 

Genetic  algorithms  [7]  are  optimization  techniques  which  imitate  the  way  that 
nature  selects  the  best  individuals  (the  best  adaptation  to  the  environment)  to  create 
descendants  which  are  more  highly  adapted.  The  first  step  is  to  generate  a  random 
initial  population,  where  each  individual  is  represented  by  a  character  chain  like  a 
chromosome  and  with  the  greatest  diversity,  so  that  this  population  has  the  widest 
range  of  characteristics.  Then,  each  individual  is  evaluated  using  a  fitness  function 
which  indicates  the  quality  of  each  individual.  Finally,  the  best-adapted  individuals 
are  selected  to  generate  a  new  population,  whose  average  will  be  nearer  to  the  desired 
solution.  This  new  population  is  created  making  use  of  three  operators:  reproduction 
crossover  and  mutation. 

One  of  the  major  aspects  of  GA  is  their  ability  to  be  parallelised.  Indeed,  because 
natural  evolution  deals  with  an  entire  population  and  not  only  with  particular 
individuals,  it  is  a  remarkably  highly  parallel  process  [8]. 

It  has  been  established  that  GA  efficiency  to  find  optimal  solution  is  largely 
etermined  by  the  population  size.  With  a  larger  population  size,  the  genetic  diversity 
increases,  and  so  the  algorithm  is  more  likely  to  find  a  global  optimum  A  large 
population  requires  more  memory  to  be  stored,  it  has  also  been  proved  that  it  takes^'a 
longer  time  to  converge.  The  use  of  today's  new  parallel  computers  not  only  provides 
more  storage  space  but  also  allows  the  use  of  several  processors  to  produce  and 
evaluate  more  solutions  in  a  shorter  time. 

We  use  a  coarse  grained  parallel  GA.  The  population  is  divided  into  a  few 
subpopulations  or  demes,  and  each  of  these  relatively  large  demes  evolves  separately 
on  different  processors.  Exchange  between  subpopulations  is  possible  yia  a  migration 
operator.  In  the  literature,  this  model  is  sometimes  also  referred  as  the  island  Model 
Sometimes,  we  can  also  find  the  term  'distributed'  GA,  since  they  are  usually 
implemented  on  distributed  memory  machines. 

Technically  there  are  three  important  features  in  the  coarse  grained  PGA'  the 
topolop  that  defines  connections  between  subpopulations,  migration  rate '  that 
controls  how  many  indiyiduals  migrate,  migration  interyals  that  affect  how  often  the 
migration  occurs. 

Many  topologies  can  be  defined  to  connect  the  demes.  We  present  result  using  a 
simple  stepping  stone  model  and  a  master-slave  model.  In  the  former,  the  demes  are 
distributed  in  a  ring  and  migration  is  restricted  to  neighboring  demes.  In  the  latter 
there  is  a  master  population  connected  to  all  the  slaves. 

Choosing  the  right  time  for  migration  and  which  individuals  should  migrate 
appears  to  be  more  complicated  and  a  lot  of  work  is  being  done  on  this  subject 
Speral  authors  propose  that  migrations  should  occur  after  a  time  long  enough  to 
a  ow  the  development  of  goods  characteristics  in  each  subpopulation[9].  However  it 
also  appears  that  immigration  is  a  trigger  for  evolutionary  changes.  In  our  algorithm 
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the  migration  occurs  after  each  new  generation,  therefore  the  algorithm  is  more  or  less 
equivalent  to  a  sequential  GA  with  a  larger  population. 

In  our  problem,  migrants  are  selected  from  the  best  individuals  in  the  population 
and  they  replace  the  worst  in  the  receiving  deme.  The  number  of  migrants  may  be 
selected  at  execution  time.  With  this  operator,  our  PGA  has  better  convergence 
properties  than  the  sequential  version. 


3.  Genetic  partitioning  for  Multi-FPGA  systems 


Figure  1  describes  the  design  and  implementation  flow  of  a  Multi-FPGA  system.  It 
starts  from  an  initial  specification  (a  netlist  or  a  HDL  description),  that  is  used  as 
XACT  input.  It  returns  the  number  of  CLB's  and  lOB's.  Then,  it  is  necessary  to 
determine  the  optimum  distribution  of  the  CLB's  on  the  different  available  FPGA's. 
An  optimum  distribution  has  a  minimal  cost  and  guarantee  the  internal  routability  of 
each  FPGA.  For  this  purpose  we  use  the  PGA  described  in  section  2. 


Fig.  1.  Design  and  implementation  flow  of  a  Multi-FPGA  system 

The  input  to  our  algorithm  must  include  the  number  of  necessary  CLB's  to 
implement  the  circuit.  In  order  to  evaluate  the  different  solutions,  it  is  also  necessary 
to  have  a  FPGA  library.  It  must  include  the  number  of  CLB's  and  the  cost  of  each 
FPGA.  In  our  case  it  has  been  used  the  corresponding  data  to  the  three  simplest 
devices  of  the  series  3000  of  Xilinx;  XC3020,  XC3030  and  XC3042.  After  the 
optimisation,  the  algorithm  returns  the  number  of  circuits  of  each  type,  the 
distribution  of  the  CLB's  and  the  percentage  of  utilisation  of  the  FPGA's. 

Our  problem  has  been  coded  as  follows:  each  individual  represents  a  distribution  of 
CLB  s  in  the  set  of  FPGA's.  We  have  supposed  we  have  three  different  types  of 
Xilinx  3000  series  FPGA's  and  we  can  use  as  many  as  necessary  [10].  Each 
individual  is  a  chromosome  with  so  many  genes  as  the  number  of  CLB's  in  the 
original  circuit.  Each  CLB  is  represented  by  a  gene,  which  has  a  different  value 
depending  on  which  kind  of  FPGA  it  uses. 
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solve  the  partition  and  placement  problem  simultaneously  with  the  routine 
t  routable  whenever  the  percentage  (pc)  of  busy  CLB's  does  nol 

exceed  aS.  This  is  one  of  the  constraints  that  our  fitness  function  satisfies  as  figure  2 
shows.  Moreover,  it  minimizes  the  final  cost  {cost)  of  the  circuit,  according  to  3000 
series  specifications  and  the  number  of  holes  (free  CLB’s).  The  term  penaln’  that 
appears  in  the  fitness  function  acts  when  the  system  is  not  routable. 


^  [{hole  )■  k,  +  Penalty  .  +  Com  .  ] 


Penalty  ,  =  F ,  *  e^i*  pc 

Fig.  2.  Cost  and  penalty  functions  used  in  the  genetic  algorithm. 


The  values  of  GA  parameters  are  the  followings:  the  crossover  probability  (P  )  i 
An  mutation  probability  (PJ  is  equal  to  0.015,  the  population  size  is  se 

individuals.  The  constants  K^,  K,  and  have  been  adjusted  experimentally  t< 
satisfy  the  constraints.  ^ 


4.  Experimental  Results 


The  circuits  that  we  have  used  for  testing  our  algorithm  have  been  obtained  from 

Y  Benchmarks  93  suite.  The  characteristics  of  these  circuits  after  usinsi 

the  XACT  tool  are  shown  in  table  1 . 

Table  2  compares  our  results  with  those  obtained  by  Kuznar.  The  comparison  is 
made  in  terms  of  cost  and  occupation  of  CLB 's. 

Table  3  comp^es  the  sequential  algorithms  to  the  parallel  versions  (using  8 
processors  in  the  Cray  T3E).  The  results  show  that  the  second  approach  has  beUer 
convergence  properties  due  to  its  non-overlapping  replacement  characteristic. 
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Circuit 

Niiin  CLB's 

num  lOB's 

C3540 

283 

72 

C5315 

377 

301 

C7552 

833 

64 

C6288 

489 

313 

S5378 

381 

86 

S9234 

454 

43 

915 

154 

S15850 

842 

101 

Table  1.  Characteristic  of  the  test  circuits 


Circuit 

ACI  cost 

AG  Pc 

Kuznar  cost 

Kuznar  Pc 

C3540 

5.20 

0.76 

4.99 

0.77 

C5315 

6.56 

0.79 

7.76 

0.52 

C7552 

14.6 

0.79 

13.66 

0.83 

C6288 

9.40 

0.70 

7.88 

0.85 

S5378 

7.56 

0.71 

6.19 

0.94 

S9234 

9.40 

0.66 

7.98 

0.85 

SI  3207 

15.96 

0.79 

16.81 

0.81 

S15850 

14.12 

0.83 

14.97 

0.80 

Table  2.  Comparison  between  the  Kuznar  and  the  GA  algortihms 


Circuit 

-Sequential 

Ring 

PAG 

Master-Slave 

PAG 

19.78  (500) 

3.883 

3.973 

C5315 

-  02000) 

10.298 

10.548 

C7552 

85.31  (725) 

11.465 

C6288 

61.78  (900) 

6.682 

8.525 

S5378 

38.75(725) 

6.504 

15.975 

S9234 

57.34(900) 

18.574 

18.995 

SI  3207 

116.627(900) 

15.734 

28.917 

SI  5850 

-  02000) 

25.980 

26.546 

Table  3.  Comparison  between  the  sequential  and  the  parallel  GA's  (time  in  seconds) 
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5.  Conclusions 


The  main  conclusions  of  our  research  can  be  summarised  as  follow;  (1)  The  PGA 
improves  Kuznar  results.  Although  the  cost  is  not  always  improved,  the  routability  of 
the  system  is  almost  assured  in  all  the  cases.  We  always  obtain  a  cost  improvement  or 
a  routing  improvement.  (2)  The  logic  blocks  distribution  that  gives  us  the  PGA 
assures  the  internal  routability  of  the  system  in  88%  of  the  cases.  The  cost  in  dollars 
of  the  resulting  circuit  has  been  reduced  also  in  45%  in  the  experiments  compared  to 
the  Kuznar  results.  (3)  The  sequential  version  of  the  GA  needs  more  than  2000 
generations  to  obtain  an  acceptable  solution,  but  the  8  processors  (in  the  worst  case) 
Ring  PGA  only  needs  225  generations  and  the  Master-Slave  PGA  300  generations 
This  result  is  due  to  simultaneous  search  and  the  implicit  non-overlapping 
replacement  of  the  PGA.  (4)  Finally,  it  is  interesting  to  note  that  the  Ring  PGA  gives 
better  results  than  the  Master-Slave,  due  to  premature  convergence  effects. 
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Abstract  In  his  1978  Turing  Lecture[l],  John  Backus  draw  the  atten¬ 
tion  of  the  computer  science  community  to  functional  Icinguages.  One  of 
the  claims  he  made  was  that  pure  fimctioncd  languages  offer  a  greater  po¬ 
tential  for  parallelism  than  other  programming  paradigms,  because  their 
property  of  referential  transparency  means  less  interdependence  between 
parts  of  a  program.  Meeting  this  promise  has  been  a  challenge.  This  pa¬ 
per,  presents  Haskell^  a  parallel  functional  Icuiguage  with  exphcit  par- 
allehsm  based  on  MPI  (Message  Passing  Interface)  for  communication 
between  functionaJ  blocks  of  code. 

1  Introduction 

Functional  languages  are  a  nicer  syntax  to  the  A-Calculus,  a  function  theory 
widely  used  to  provide  semantics  to  programming  languages  of  all  paradigms[2]. 
The  Church-Rosser  theorems  state  that  normal  forms  of  A-expressions  are  unique 
modulo  variable  renaming  and  that  reductions  of  the  leftmost-outermost  re¬ 
ducible  expression  at  each  point  of  the  reduction  sequence  leads  to  normal  form, 
if  it  exist[2].  Thus,  the  reduction  can  be  done  even  in  parallel.  This  suitability 
of  functional  languages  for  parallel  processing  have  led  various  researchers  to 
propose  different  parallel  implementation  of  functional  languages. 

The  history  of  parallel  functional  programming  complies  two  phases.  The 
first  period,  the  1980s  and  before,  corresponds  to  the  time  in  which  parallelism 
was  sought  as  a  way  to  make  functional  languages  run  as  fast  as  imperative  ones. 
The  second  period  is  the  time  in  which  “real”  parallel  processing  can  find  an 
alternative  in  functional  programming. 

The  first  attempt  to  exploit  parallelism  of  functional  programs  targeted  ei¬ 
ther  the  evaluation  of  actual  parameters  before  replacing  them  by  the  formal 
parameters  or  was  done  at  combinator  argument  level.  These  strategies  lead  to 
a  very  fine  grained  parallelism.  As  a  result,  it  was  believed  that  novel  archi¬ 
tectures  were  necessary  to  achieve  high  performance  with  functional  languages, 
and  this  led  to  a  spate  of  designs  for  special-purpose  machines,  such  as  ALICE 
(Transputer-based),  ICL  Flagship,  EDS/Goldrush  and  GRIP. 
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Unfortunately,  these  experiments  proved  that  builHino-  =  ■  i 

funConal  prog,.mminXg„Srhr.\^^^^^  Par.lleUn,pleme„..t.on  of 
Th.se  i.npl..e.t.fL,  h\vfroJ.::ot“u7es!"*'^ 

’  i„'”Sroo?"‘'°  H  '  ‘“'•a  dy- 

.  s  7~ 

men  7,  g'TwjTh'r^oar  \“6h«-l.vel  parallel  environ. 

system  w'orkloS'  Users  XetnlyTr* <‘'P»<iinS  Pn 
tasks.  *  ^  indicatives  to  potential  parallel 

■  Piogram/data  graph,  so 

!:sfi  srar„:!n"t 

sng^stTeraLru^^S^^ 

Message  Passing  Interface  (MPI)f5l  MPI  nr^  'h  allocation,  through 

™ing  model  and  will  be  usfd  tfL^J  T  ®  program- 

munication  between  tasks  Furfh^s  he  creation,  distribution  and  corn- 

sequential  run-tim"  ystm.  task  will  possess  its  own  local 

pro«h7t7.“;tp7S7o1  "  r sp”'  “p- 

rely  on  the  programmer’s  ability  for  ind'  f  problem.  Somesystems 

uating  in  pLaflel.  ^hese  sysremrpr^^^^^^^^^ 

the  language.  Other  make  use  of<!om  in  •  ‘Constructs/ annotations  into 

of  some  expressions  -«ts 

execution-profiling  information  for  nrov^d'^  annotate  the  program.  Other  yet  use 
spend  in  e.h  fin  - 

1.1  Implicit  Parallelism 
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able  in  general  and  strictness  an  ^  however,  neededness  is  undecid- 
realistic'parallel  pmgims  -th 
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Despite  efforts  of  several  research  groups  around  the  world,  trying  exploit 
implicit  parallelism  in  functional  languages,  results  are  still  very  shy.  In  our 
opinion  this  is  a  direct  consequence  of  communication  overhead  brought  by  very 
fine  granularity  tasks  generated  for  this  strategy. 

1.2  Explicit  Parallelism 

In  explicitly  parallel  languages  -  such  as  Occam[4]  -  it  is  up  to  users  setting  paral¬ 
lel  tasks.  Results  of  implicit  parallel  implementations  of  functional  languages  as 
well  as  the  belief  that  the  bottom  line  of  any  parallel  system  is  raw  performance, 
and  a  program’s  performance  can  only  be  improved  if  it  can  be  understood [10], 
led  a  number  of  researchers  to  exploit  explicit  parallelism. 

Improving  a  sequential  program  by  partitioning  it  in  parallel  tasks  is  not  a 
simple  work  and  requires  a  complete  knowledge  of  the  program  as  well  as  the 
architecture  it  will  execute.  Annotation  for  parallelism  are  usual.  Hope-t-  on 
Flagship  employs  strictness  annotation  to  control  the  precise  degree  of  evalua¬ 
tion. 

2  Haskell 

Haskell  is  a  general  purpose,  pure  functional  programming  language  which  incor¬ 
porates  higher-order  functions,  non-strict  semantics,  static  polymorphic  typing, 
user-defined  algebraic  datatypes,  pattern-matching,  list  comprehensions,  a  mod¬ 
ule  system,  monads,  and  a  rich  set  of  primitive  datatypes,  including  arrays, 
arbitrary  and  fixed  precision  integers,  and  floating-point  numbers[3,  9].  Haskell 
has  now  become  de  facto  standard  for  the  non-strict  functional  language. 

Among  the  implementations  of  Haskell  compilers  Concurrent  Haskell[8]  and 
GUM [10]  seems  to  be  very  promising. 

Concurrent  Haskell  is  a  concurrent  extension  to  lazy  functional  Haskell,  which 
provide  a  more  expressive  substrate  to  build  sophisticated  I/O-performing  pro¬ 
grams,  notably  ones  that  support  graphical  user  interfaces  for  which  the  useful¬ 
ness  of  concurrency  is  well  established.  The  goal  of  the  designers  of  Concurrent 
Haskell  is  to  attain  implicit,  semantically  transparent  parallelism,  but  the  version 
available  now  uses  explicit  parallelism. 

GUM  is  a  portable,  massage-based  parallel  implementation  of  Haskell.  Porta¬ 
bility  is  facilitated  by  using  PVM  communications  harness  that  is  available  on 
many  multi-processors.  GUM  is  available  for  both  Shared-memorj'  distributed- 
memory  (network  workstations)  architecture.  Initial  performance  figures  demon¬ 
strate  speedups  relative  to  sequential  compiler  technology. 

3  MPI 

Message  Passing  is  a  paradigm  widely  used  on  loosely  coupled  parallel  machines. 
Although  there  are  many'  variations,  the  basic  concept  of  processes  communi- 
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eating  through  messages  is  well  understood.  Over  the  last  ten  years,  substantial 
progress  has  been  made  in  casting  significant  applications  in  this  paradigm. 

i  he  mam  advantages  of  using  a  message-passing  standard  are:  efficienev 
portabiht.v  and  ease-of-use.  In  a  distributed  memory  communication  environ¬ 
ment  in  which  the  higher  level  routines  and/or  abstraction  are  built  upon  lower 
level  message  passing  routines  the  benefits  of  standardization  are  particularlv 
apparent.  Furthermore,  the  definition  of  a  message  passing  standard,  such  a's 
that  proposed  in  [5],  provides  vendor  with  a  clearly  defined  base  set  of  routines 
that  they  can  implement  efficiently,  or  in  some  cases  provide  hardware  support 
tor,  thereby  enhancing  scalability.  MPI  also: 


•  provides  an  application  programming  interface; 

•  allow's  efficient  communication:  avoid  memory-to-memorv  copying  and  al¬ 
lows  overlap  of  computation  and  communication  and  offload  to  communi¬ 
cation  co-processor,  where  available; 

•  allows  for  implementations  that  can  be  used  in  heterogeneous  environ¬ 
ments; 


.  allows  convenient  C  and  Fortran  77  bindings  for  the  interface; 

•  assumes  a  reliable  communication  interface:  the  user  need  not  cope  with 

communication  failures.  Such  failures  are  dealt  with  by  the  underlying 
communication  subsystem;  ° 

*  different  from  current  practice,  such  as 

flexibility  Provides  extensions  that  allow  greater 


•  defines  interfaces  implemented  on  many  vendors’  platforms. 

The  parallel  programming  model  supported  by  our  implementation  is  mes¬ 
sage  passing:  a  set  of  tasks,  each  executing  in  its  own  address  space,  commu- 
mcatmg  via  calls  to  the  Message-Passing  Library.  Such  a  parallel  programming 
model  offers  a  multitude  of  alternatives:  some  functions  supported  by  microcode 
on  the  adapter  and  some  by  software  on  the  computing  processor;  some  functions 
executed  in  user  space  and  some  by  kernel;  trade-offs  between  more  extensive  use 
o  buffering  and  data  copying  and  more  eager  use  of  interrupts;  ‘-push"  versus 

pull  •  protocols;  flow  control;  etc. 


4  Haskell^^ 


Haskell^  is  a  new  language  composed  by  parallel  constructors  (a  subset  of  MPI 

mS  and  functional  programs  (Haskell  programs).  An 

IBM  SP2  System  with  9  (nine)  processor  nodes  was  chosen  as  testbed. 
Haskell#  has  some  important  differences  from  other  implementations: 


•  an  explicit  static  task  allocation  is  adopted: 
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•  MPI  is  used  to  manage  a  coarse  task  program  distribution; 

•  each  task  is  actually  a  functional  program,  with  a  local  run-time  system 
completely  independent  of  the  manager  task  module. 

4.1  Parallel  Module 

Now,  we  will  describe  the  main  ideas  on  using  MPI  to  implement  Haskell:^:. 

Program  Structure  Communication  functions  of  MPI  will  be  used  to  express 
the  parallelism  following  the  same  mechanism  present  in  Occam  [4]  programming 
language.  According  to  the  parallel  constructors  inserted  by  the  user  in  a  given 
Haskell^  program,  MPI  spawns  the  required  number  of  processes  to  the  avail¬ 
able  processors.  Thus,  Haskdl-f^  enables  an  application  to  be  described  as  a 
collection  of  processes,  where  each  process  executes  concurrently,  and  communi¬ 
cates  with  other  processes  through  channels.  Each  process  in  such  an  application 
describes  the  behaviour  of  a  particular  aspect  of  the  implementation,  and  each 
channel  describes  the  connection  between  each  of  the  processes. 

Communication  Library  MPI  supports  two  classes  of  message  passing  func¬ 
tions:  point-to-point  calls,  which  send  a  message  from  one  task  to  another  task, 
and  collective  communication  calls,  which  establish  a  communication  pattern 
within  a  group  of  tasks. 

MPI  point-to-point  communication  includes  blocking  and  non-blocking  send 
and  receive  functions.  Use  of  non-blocking  sends  and  non-blocking  receives  are 
both  safe  (in  terms  of  deadlock  avoidance)  and  efficient.  Some  extra  program¬ 
ming  effort  is  required,  since  the  programmer  must  determine  the  status  of  the 
communication  before  reusing  the  buffer  (the  memory  location  in  the  user’s  pro¬ 
gram  that  holds  the  message  data  before  transmission  or  after  receipt) 

Blocking  routines  protect  naive  programmers  from  accidentally  altering  mes¬ 
sage  buffer  contents.  The  trade-off  can  be  increased  communication  cost.  Dead¬ 
lock  can  occur  in  cases  where  a  large  message  volume  is  being  sent.  The  situ¬ 
ations  most  appropriate  for  blocking  routines  are  those  in  which  there  is  little 
work  that  can  be  done  between  initiation  of  the  communication  and  use  (or 
reuse)  of  the  buffer. 

In  this  first  approach,  Haskell:^:  uses  MPI  point-to-point  call  functions.  Fur¬ 
thermore,  ill  order  to  provide  safety  and  higher  performance,  we  adopt  the  MPI 
non-blocking  communication  library. 

4.2  Run-Time  System 

The  Recife  Haskell  Compiler  (RHC)  run-time  system  was  adopted  as  evalua¬ 
tion  environment  of  the  value  expressions  executing  in  a  SP2  proce.ssor  node. 
Each  process  represents  an  individual  sequential  Haskell  program,  evaluated  by 
/rPCMC,  an  abstract  machine  for  efficient  implementation  of  lazy  functional  lan¬ 
guages.  fiTCMC  transfers  the  control  of  the  execution  flow  to  C,  as  much  as 
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possible,  to  take  advantage  of  the  extremely  low  costs  of  procedure  calls  in  mod¬ 
ern  RISC  architectures.  This  yielded  a  substantial  improvement  in  performance. 

Almost  all  implementations  of  parallel  graph  reduction  proceeds  on  a  shared 
program/data  graph[10],  thus  a  primary  function  of  the  run-time  system  of  these 
parallel  functional  languages  is  to  manage  the  virtual  shared  memory  in  which 
graphs  resides.  However,  in  contrast  with  previous  implementations,  Haskell# 
do  not  proceed  parallel  graph  reduction  on  a  shared  program/data  graph.  Here, 
individual  task  (process)  has  its  own  local  stacks  and  heap.  As  a  result,  Haskell# 
performs  garbage  collection  locally. 

4.3  Conclusions 

In  this  paper,  we  presented  the  fundamental  ideas  behind  Haskell#  and  drew 
comparisons  with  its  supposedely  competitors.  Haskell#  is  a  simple  explicit 
parallel  functional  language  where  the  MPI-based  communication  combinators 
glue  together  large  chunks  of  pure  Haskell  code,  allowing  a  hierarchical  pro¬ 
gramming  discipline  that  rescues  the  ability  of  reasoning  about  parallel  func¬ 
tional  programs,  feature  lost  by  our  competitors  by  including  the  parallel  com- 
binators  in  the  language  themselves. 

Reference  [7],  presents  further  details  of  Haskell#  language  such  as  its  se¬ 
mantic  model  of  parallelism  as  well  as  performance  figures  for  benchmarks  run¬ 
ning  on  a  9-node  IBM-SP2  platform. 
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Abstract  The  State  Estimation  is  nowadays  considered  the  fundamental  element  of 
modem  electrical  power  networks  control  centers.  In  this  paper  we  develop  a  theoretically 
robust  and  computationally  efficient  state  estimator  algorithm,  to  solve  the  WLS  problem 
by  using  parallel  processing.  The  computational  aspects  of  the  parallel  processing,  was 
analysed  and  tested  using  the  IEEE  30,  57  and  1 1 8  bus  systems.  Computational 
experiments  are  compared  with  standard  WLS  methods,  in  the  integral  and  distributed 
version.  An  evaluation  of  the  degree  of  natural  decoupling  in  the  state  estimation  problem 
is  also  performed.  The  results  indicate  that  a  distributed  processing  for  state  estimation,  is 
the  better  way  to  adopt  the  parallel  computing  in  power  systems  energy. 


1.  Introduction 

The  implementation  of  robust  methods  for  power  system  state  estimation,  which  maintain 
performance  suitable  to  the  large  models  encountered  in  modern  control  centres  is  a  topic  that 
has  received  significant  attention.  The  estimator  processes  real-time  redundant  telemeter  and 
pseudo  measurements  to  provide  a  complete,  coherent  and  reliable  system  database,  which  can 
describe  the  electrical  state  of  the  network  [l]-[2].  These  measurements,  which  include  voltage 
magnitudes,  real  and  reactive  line  flows  and  nodal  power  injections,  are  measured  from  the 
network  at  a  certain  moment,  thus  getting  an  estimate  for  the  respective  state  vector  (vector  of 
voltages  modules  and  phases  on  different  buses)  [3].  The  higher  frequency  in  state  estimation 
execution  requires  the  development  of  faster  state  estimation  algorithms.  The  larger  size  of  the 
supervised  networks  will  increase  the  demand  on  the  numerical  stability  of  the  algorithms.  At 
same  time,  conventional  centralised  state  estimation  methods  have  reached  a  development  stage 
in  which  important  improvements  in  either  speed  or  numerical  robustness  are  not  likely  to 
occur.  These  facts,  together  with  the  technical  developments  in  fast  data  communication 
network  technology,  opens  up  the  possibility  of  parallel  and  distributed  implementations  of  the 
state  estimation  algorithms  [4]-[5].  The  nature  geographically  distributed  of  power  system 
applications,  can  benefit  from  this  form  of  decentralised  computer  architecture,  in  which 
several  remote  processors  perform  local  state  estimation  in  network  areas  and  the  results  are 
send  to  a  central  computer  that  refines  the  calculation.  The  power  system  under  consideration 
may  be  partitioned  into  k  areas,  and  each  area  is  supervised  by  a  local  control  center.  The 
measurement  data  in  each  area  will  be  collected  in  each  individual  local  control  center  that  has 
at  least  one  computer  system  for  data  acquisition,  data  processing,  and  computation  [9].  The 
computer  systems  of  adjacent  areas  are  connected  by  fast  data  communication  links,  and  these 
decentralised  computer  systems  form  a  computer  network. 
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2.  WLS  State  Estimation  Problem 

Mathematically,  the  information  model  used  in  power  system  state  estimation  is  represented  by 
the  equation: 

z  =  h(x)  +  e  (1) 

Where  z  is  a  (m*  1 )  measurement  vector,  x  is  a  (n*l)  true  state  vector,  h(.)  is  a  (m*l)  vector 
of  non-linear  functions,  «  is  a  (m*l)  measurements  error  vector,  m  is  the  number  of 
measurements,  and  n  is  the  number  of  state  variables.  The  static-state  estimation  problem  of  a  N 
bus  power  system,  is  a  weight-least-squares  (WLS)  optimisation  problem: 

m 

mmJ{x)  =  Y, w, (z,  - h. (x))^  =[z- h(x)T W[z - /!(x)]  (2) 

Weight  M',  represent  the  weight  associated  with  measurement  z.  Weights  are  chosen  as 
proportional  to  the  accuracy  of  the  measurements:  the  higher  the  accuracy  of  a  measurement 
the  bigger  its  weight.  The  solution  of  this  optimisation  problem  gives  the  estimated  state  X , 
which  must  satisfy  the  following  optimality  condition: 

^  J  (  X  )  ^  r  r  T 

— T - -  0  =>  H  U  X  )W  [z  -  h(  X  0  (3) 

OX 

Where 

H(x)  =  ^ 

OX 

IS  the  Jacobean  matrix  of  the  measurement  function  h(x).  The  solution  of  the  non-linear 
equation  (3)  may  be  obtained  by  an  iterative  method  in  which  a  linear  equation  of  following 
type  is  solved  at  each  iteration  to  compute  the  correction, 

=  x'  -h  Ax' 

[G(x')]Ar'=/:/"(;t')W/[z-/j(x')]  (4) 

where  G(x)  is  called  the  gain  matrix  and  is  usually  chosen  as 

G(x)  =  H^(x)WH  (x) 

Eq.(4)  is  called  the  normal  equation  of  the  WLS  problem.  As  in  loadflow  calculations,  it  has 
been  found  that  state  estimation  algorithms  based  on  decoupled  versions  behave  adequately  for 
the  usual  power  networks  [2].  Therefore,  the  decoupled  model  that  has  been  mostly  adopted  is: 

Zp  =  hp(e,v)  +  gp  (5) 

Zq  =  h^(e.v)  +  e^  (6) 

where  d  (09*  1)  and  v  (n^*!)  are  the  vectors  of  true  voltage  magnitudes  and  phase  angles,  p 
and  q  indicating  partitions  of  vectors  and  matrices  corresponding  to  active  and  reactive 
measurements,  respectively; 

7X0=  N-1  ;  7t^  =  N, 

N  is  the  number  of  network  nodes.  This  naturally  decoupled  characteristic,  make  this 
method  suitable  for  parallel  processing  implementation,  with  a  great  reducing  of  the  required 
computation  time. 
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3.  Parallel  and  Distributed  State  Estimation  Problem 

If  we  decompose  the  power  network  into  “K”  areas,  connected  through  boundary  buses  which 
belongs  simultaneously  to  both  adjacent  areas,  the  state  estimation  problem  introduced  in  (5) 
and  (6)  can  be  presented  as 

where  and  are  vectors  of  active  and  reactive  measurements  in  area  /t;  0*  and 
V  are  vectors  of  voltage  phase  angles  and  magnitudes  in  area  k,  including  the  ones 
corresponding  to  the  boundary  buses.  The  number  of  boundary  buses  may  be  kept  to  a 
minimum  and  there  are  no  injection  measurements  in  the  overlapping  area  buses.  This  is  not  a 
limitation,  because  actual  injection  measurement  buses  in  overlapping  areas,  can  be  replaced  by 
fictitious  buses  with  no  injection  measurements  connected  to  the  actual  buses,  now  placed 
outside  the  overlapping  area,  by  zero  impedance  lines  [10].  Then,  the  problem  of  distributed 
state  estimation  is  to  use  the  computer  network  associated  with  the  measurement  data  collected 
in  each  local  control  center  to  solve  the  following  weighted  least  square  (WLS)  problem  in  a 
distributed  way: 

min 

*=i 

(9) 

min  it: -<{.)]  [<]''t:-^;(-)]=o 

The  iterative  solution  of  above  problem,  for  k  is: 

eV;+U=sV/H[G;,r[«;rk,rt;  -h‘(e,(i),v,(i)i  ,10) 

v*(i+i)=v*(i)+[G*n«]rk‘rk  -A,‘(e,(o,v.(i))]  dn 

Where 

= [hi  j  k:  ]■'  g; =[//,:  I  [/?;  y  hi 

are  the  Jacobean  matrix,  calculated  for  the  initial  conditions  and  kept  constant  in  the 
iterative  process.  In  the  boundary  buses,  the  elements  (0,  V  )  obtained  in  (10)  and  (11)  must  be 
affected  with  a  weight  medium  of  the  values  calculated  in  the  neighbouring  areas  k  and  j  [8], 
and  take  the  form. 


k=l, . 

. K 

(7) 

k=\ . 

. K 

(8) 

h  * 

e  (/  +  i)  =  0*(f  +  i)+A0‘(/  +  i) 


V  (t  +  1)  =  v*"  (/  +  !)  + Av''(f  +  1) 


(12) 

(13) 
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aK;(i  +  1)  =  ,  [v,‘(/  +  i)  -  v/(i  t  I)]  (IS) 

Srr  and  are  diagonal  elements  corresponding  to  boundary  bus  r  of  the  inverse  gain 
matrices  of  the  neighbouring  area  k  and  j,  respectively. 

4.  Analysis  of  Computation  Experiments 

The  Parallel  and  Distributed  State  Estimation  methodology  analysed  in  this  paper  was  tested 
rortrr77  Mcl  Virtual  Machine)  software,  with  pro^am  ZlSt 

Sr  ^  machine  with  a  Ultrix  operating  system  The 

distributed  computer  system,  connected  in  a  network,  used  in  practice  for  parallel  or  distributed 
areas  processing,  was  simulated  with  recurrence  to  PVM  performances  [6],  that  enables  one  m 

faS  em  'S*  niessage-passing  between  tasks,  to  synchronise 

tasks,  etc.  The  convergence,  accuracy  and  numerical  efficiency  of  the  proposed  simulation 
study  are  presented  in  the  following  sections. 

4.1  Parallel  Processing  in  the  Integral  Version 

The  algorithm  implemented  for  this  integral  study  version  is  represented  in  figure  1  The  nature 
decoupled  of  equation  (10)  and  (11)  make  the  algorithm  suitable  for  parallel  implementation 
The  algonthm  presented  in  flowchart,  calculates  the  6  and  v,  update  at  every  iteration  in  a 
synchronous  way.  The  IEEE  30,  57  and  1 18  bus  standard  networks  were  used  to  perform  this 
test.  Two  levels  of  global  redundancy  were  specified  for  each  measurement  system;  normal  and 
low  level.  Table  1  shows  the  data  for  each  test  case.  In  this  table  J  is  the  sum  of  squared  errors 
in  the  estimates  of  measured  variables. 


•AquIsKion  Data 
■Network  Power 
Inrormaitoft  , 


Parallel  Processing  I  |  Parallel  Processing  | 


Processing 

Stale 

Estlmailon 
Area  I 


Processinj 

Slate 

Esllmatioii 
Area  K 


«  ''  Information  of  Boundary  Buses 

Siorao  tno  Oispiiy 
PesuUits 

J  ^ 

^  ^  C  ')  (  End 

Fig.  1.  Parallel  Processing.  Integral  version.  Fig.  2.Parallel  Processing.  Distributed  version. 

All  test  simulations  converge  in  2  iterations  for  standard  WLS  method  and  8  iterations  for 

0.00  pu  and  0.001  rad.  for  module  and  phase  of  voltage.  From  we  can  see  that  figure  3,  the 
arallel  Processing  in  integral  version  is  not  so  accurate  like  the  MDE  method.  In  a 
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synchronous  process,  due  to  idle  times,  the  algorithm  has  to  wait  until  the  state  vector  is 
updated  before  it  starts  a  new  iteration.  If  we  run  the  above  algorithm  in  an  asynchronous  way, 
the  precision  of  state  estimation  vector  will  be  drastically  deteriorated. 

Table  1:  Test  Case  Data 


BS9I 

N"  uf 
Bus 

WLS 

MDE 

P.Prttcc.s.s. 

■SB 

■BSB 

J 

A] 

30 

1.4 

0.39 

30.5 

0.24 

32.7 

0.57 

.3  2.7 

A2 

.to 

2.4 

!  .00 

95.0 

0.34 

97.0 

0.69 

9  7.0 

B  1 

.17 

5.40 

91.5 

1.60 

100 

1.60 

100 

B2 

57 

y.oo 

172 

2.70 

1(15 

2.70 

1X5 

Cl 

liK 

1  15 

419 

30.0 

425 

34 

425 

C2 

1  IX 

220 

S02 

70.0 

X06 

74 

X06 

4.2  Parallel  Processing  in  the  Distributed  Version 

Synchronous  computation  become  too  expensive  when  the  processors  are  geographically 
distributed  [7].  So,  asynchronous  concurrent  processing  is  an  attractive  alternative.  We 
analyzed  this  fact,  dividing  the  test  cases  presented  in  table  1  in  some  areas  and  processing  the 
equations  (10)  and  (11)  for  each  area,  like  shown  in  flowchart  of  figure  2.  For  the  boundary 
buses,  in  the  end  of  the  asynchronous  iterative  process,  we  applied  the  restriction  (12)  and  (13). 
The  convergence  obtained  for  0.001  pu  and  0.001  rad,  the  processing  time,  accuracy  and 
numerical  efficiency  are  shown  in  table  2  for  WLS  version,  and  table  3  for  MDE  version.  The 
results  presented  demonstrate  that  in  parallel  distributed  state  estimation,  we  can  get  an 
elevated  reduction  of  processing  time,  for  essentially  the  same  number  of  iterations,  compared 
with  integral  methods  showed  in  table  1.  The  accuracy  of  results,  generally,  is  better  for  cases 
with  more  redundancy  of  measurements  and  for  WLS  state  estimation  version.  The 
improvement  in  processing  time  for  MDE  method,  compensate  the  small  depreciate  of  results, 
compared  with  WLS  version.  In  figure  3  we  can  see  the  performance  of  Parallel  Processing  in 
the  Distributed  Version  (PPD),  applicated  to  test  case  C2  (118  buses),  comparing  the 
processing  time  for  standard  WLS  and  MDE  state  estimation  methods  and  the  Parallel 
Processing  in  the  Integral  (PPI)  and  Distributed  version,  comparing  the  processing  time  for 
standard  WLS  and  MDE  state  estimation  methods  and  the  Parallel  Processing  in  the  Integral 
(PPI)  and  Distributed  version. 


Table  2;  Parallel  Processing  of  Distributed  Areas. 
Estimation  accuracy  for  WLS  version. 


Tcsl 

Cusc 

N"(>f 

Iicr 

Time 

(s) 

^HcED^Su^l 

Average  error  in 
voltage  magnitud 
(pu*1000) 

J 

AI 

5  X 

0.1 1 

l.XX  .  10.9 

1.31  i.79 

14.5-  19.1 

A2 

5;7 

0.13 

1 .49  -  6.53 

0.99  .  1.01 

39.0  -  57.6 

Bl 

6 :  S 

0.37 

LOS -2.1 

1.45-2.07 

3X.3-49.1 

B2 

5;X 

0.54 

1.02-  1.9 

1.0-0.78 

X 1  -  95 

Cl 

S;9;  II;  6 

2.(K) 

0.56-2.37-0.30-1.55 

0.8I-0.97-0.S2-0.96 

63-75-1 14-146 

C2 

5;  6;  11:5 

4.00 

(141.1. 85-0. IX.0.31 

0.95-0.96.0.94-1. ox 

167-161-22X-335 

Table  3:  Parallel  Processing  of  Distributed  Areas. 
Estimation  accuracy  for  MDE  version. 


Test 

Case 

N“  of 
Iter 

Time 

<s) 

Average  error  in 

phase  angles 
(rad*  1000) 

Average  error  in 
voltage  magnitud 
(pu‘1000) 

J 

mm 

2;4 

mnm 

1.92-8.27 

1.26-1.98 

12.8-  17.7 

El 

2;4 

mwBM 

1.33 -.5.13 

0.99-1.26 

35.8  -  57.8 

El 

2  ;3 

1.15 

0.92-1.90 

1. 30 -2.00 

.35.5 -.50.1 

il 

2;3 

1.70 

0.88-  1.70 

0.95-1.06 

71  -  86 

Cl 

2;  2:  2;  4 

mmm 

0.59-1.69-0.35-1.02 

0.81-0.9.5-0.81-1.29 
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C2 

2;  2;  2:  3 

15.00 

0.36-0.84-0.18-0.3 

1.02-0.95.0.96-0.3 

152-143-226-290 
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Fig.  3.  Parallel  Processing  Improvement. 


5.  Conclusions 

In  this  paper  some  methodologies  for  parallel  state  estimation  were  introduced  and  tested, 
based  in  conventional  algorithms,  like  standard  WLS  version  and  standard  decoupled  MDE 
version.  The  results  of  computational  experiments  show  that  for  integral  processing  of  state 
estimation,  the  parallelism  of  algorithms  does  not  bring  any  improvement,  compared  with  the 
conventional  decoupled  MDE  algorithm.  A  distributed  computing  is  the  better  way  to  adopt  the 
parallel  computing  in  power  systems  energy.  This  fact  was  simulated  tearing  the  IF.FF.  standard 
test  cases  in  some  areas.  The  PVM  software  tool,  enables  the  simulation  of  distribute  tasks  on 
various  processors.  The  idle  times  of  processors,  synchronous  computations  become  too 
expensive  when  the  processors  are  geographically  distributed,  so  we  tested  the  asynchronous 
processing.  For  boundary  buses,  we  apply  the  restrictions  indicated  in  (12)  and  (13).  The 
computational  results  show  that  with  this  distributed  methods  we  get  a  very  high  improvement 
in  manner  of  time  processing,  compared  with  integral  standard  version.  The  only  drawback  is 
the  discrepancy  in  values  of  boundary  bus  state  variables  estimated  using  different  sets  of 
measurements,  but  in  cases  with  higher  redundancy  levels,  the  values  of  the  discrepancies  are 
acceptable  and  the  effect  on  computational  efficiency  is  minimal. 
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Abstract.  Linear  systems  of  the  form  Ax  =  6,  where  the  matrix  A 
is  symmetric  and  positive  definite,  often  arise  from  the  discretization  of 
elliptic  partial  differential  equations.  A  very  successful  method  for  solving 
these  linear  systems  is  the  preconditioned  conjugate  gradient  method. 

In  this  paper  we  study  partdlel  preconditioners  for  the  conjugate  gradi¬ 
ent  method  based  on  the  block  two-stage  iterative  methods.  Sufficient 
conditions  for  the  validity  of  these  preconditioners  are  given.  Computa¬ 
tional  results  of  these  preconditioned  conjugate  gradient  methods  on  two 
paraUel  computing  systems  are  presented. 


1  Introduction 

We  study  the  parallel  solution  of  a  linear  system 

Ax  =  b,  (1) 

where  A  €  is  a  symmetric  and  positive  definite  matrix  (i.e.,  A  =  and 
X  Ax  >  0,  for  all  real  a:  0)  and  x  and  b  are  n— vectors. 

Preconditioned  conjugate  gradient  methods  (PCG)  can  be  used  for  the  solu¬ 
tion  of  (1).  Descriptions  of  these  methods  can  be  found  e.g.,  in  Concus,  Golub 
and  O  Leary  [3]  or  Ortega  [9] .  The  idea  of  the  PCG  method  consists  of  applying 
the  conjugate  gradient  method  (see  [5])  to  a  better  conditioned  linear  system 
Ax  =  b,  where  A  =  5A5^,  x  =  5-^x,  and  6  =  Sb.  The  matrix  M  =  (5^5)"^ 
is  called  the  preconditioner  or  preconditioning  matrix.  The  PCG  method  may 
be  applied  without  computing  A,  but  solving  the  auxiliary  system 

Ms  =  r,  (2) 

at  each  conjugate  gradient  iteration,  where  r  =  6  —  Ax  is  the  residual  at  the 
corresponding  iteration. 

One  of  the  pneral  preconditioning  techniques  is  the  use  of  the  truncated 
series  preconditioning.  These  preconditioners  consist  of  considering  a  splitting 
of  the  matrix  A  as 

A  =  P-Q,  (3) 
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and  performing  m  steps  of  the  iterative  procedure  defined  by  the  splittinE 
toward  the  solution  of  yls  =  r,  choosing  s(o)  =  0.  It  is  well  known  that  the 
solution  of  the  auxiliary  system  (2)  is  effected  by  s  =  (/  +  /?  +  +  ...  + 

T+  R'^+  preconditioning  matrix  is  Mm  =  P(/+ 

It  is  in  these  terms  that  in  Section  2,  we  construct  the  preconditioner  based 
on  the  two-stage  methods  and  we  study  its  validity.  Moreover,  in  Section  3 
we  eva  uate  the  performance  of  the  resulting  PCG  algorithms  on  two  different 
parallel  distributed  memory  multiprocessors. 

2  Parallel  block  two-stage  preconditioners 

Let  us  consider  the  splitting  (3),  where  P  is  a  block  diagonal  matrix,  denoted 


P  =  Diag(Pi,...,Pp), 


(4) 


and  Pj,  I  <  j  <  p,  are  square  nonsingular  matrices  of  order  nj,  =  n. 

Note  that  performing  m  steps  of  the  iterative  procedure  defined  bylhe  above 
splitting  to  approximate  the  solution  of  =  r,  corresponds  to  perform  m  steps 
of  a  Block-Jacobi  type  method.  Thus,  at  each  step  /,  /  =  1, 2, . . . ,  m,  of  a  Block- 
Jacobi  type  method,  p  independent  linear  systems  of  the  form 

Pjs^‘^  =  (Qs(‘-^y+r)j,  l<j<p,  (5) 

with  =  0,  need  to  be  solved;  therefore  each  linear  system  (5)  can  be  solved 
by  a  different  processor.  However,  when  the  order  of  the  diagonal  blocks  Pv,  1  < 
j  <  p,  IS  large,  It  is  natural  to  approximate  their  solutions  by  using  an  iterative 

presence  of  a  twcvstage  iterative  method;  see 
e  g-.  14J,  IbJ,  17J,  [8J.  In  a  formal  way,  let  us  consider  the  splittings 


Pj  —  Bj  —  Cj,  I  <  j  <p, 


(6) 


and  at  each  /th  step  perform  for  each  j,  1  <  j  <  p,  q{j)  iterations  of  the  iterative 
procedure  defined  by  the  splittings  (6)  in  order  to  approximate  the  solution  of 

?  ■  auxiliary  system  (2)  of  the  PCG  method,  we  use  m 

steps  of  the  iteration 

s^‘)=Ts^‘-^)  +  W-^r,  /=l,2,...,m, 

choos^ing  sW  =  0,  where  T  =  P+(/-P)p-iQ,  w  =  P(I-ff)-\  with  P  defined 
m  (4)  and  B  =  Diag((Br^Ci)<'(i), . . .,  (p-iC'p)^^);  see  e  g.,  [7].  Then,  the 
updated  vector  from  m  steps  is  given  by  =  (I-i-T+T^  +  ■  ■ 

Therefore,  the  preconditioner  related  to  the  block  two-stage  methods  is  given  by 


Mm  =  W(/  +  T+T^  -I - |.7.m-lj-l 


(7) 
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In  the  rest  of  this  section  we  check  the  validity  of  this  preconditioner.  We 
give  sufficient  conditions  on  the  splittings  to  assure  that  Mm  is  symmetric  and 
positive  definite.  Given  a  square  real  matrix  A,  the  splitting  A  =  P  -  Q  is 
P-regular  if  and  only  if  +  Q  is  positive  definite. 

Theorem  1.  Let  A  be  a  symmetric  positive  definite  matrix.  Let  A  =  P  -  Q 
be  a  splitting  of  A,  where  P  =  Diag(Pi, . . . ,  Pp)  is  the  block  diagonal  matrix 
defined  in  (4).  Suppose  that  P  is  symmetric  and  Q  is  positive  semidefinite.  Let 
Pj  =  Bj  -  Cj,  f  <  j  <p,  be  P -regular  splittings  such  that  Bj  is  symmetric. 
Then  the  preconditioning  matrix  Mm  defined  by  (7)  is  symmetric. 

Proof.  The  matrix  W~^  =  (/  —  H)P~^  can  be  written  as 

W-^  =  Diag((/  -  \ (fip-'Gp)’(P))Pp-i) 

(9(i)-i  g(p)-i  \ 

E  {Bi^CiYB^^...,  (Pp-»Cp)‘Pp-i  .  (8) 

.=0  <=0  / 

Since  Pj  and  Bj,  1  <  j  <  p,  are  symmetric,  Cj  is  also  symmetric.  Then,  it  is  easy 
to  see  that  W  ^  is  symmetric.  On  the  other  hand,  the  matrix  T  can  be  written  as 
T  =  I—W~^A.  Then,  from  (7)  it  obtains  Af“^  =  (I+T+T'^^ - _ 

m— 1  ' 

^^(7  —  W  ^Ayw  ^ .  Thus,  the  matrix  is  a  linear  combination  of  terms 

i=0 

of  the  form  {W  ^AYW~^,  t  =  0, 1, . . .,  m  —  1,  which  are  symmetric.  Then,  the 
proof  is  completed. 

Theorem  2.  Let  A  be  a  symmetric  positive  definite  matrix.  Let  A  =  P  -  Q 
be  a  splitting  of  A,  where  P  —  Diag(Pi, . . . ,  Pp)  is  the  block  diagonal  matrix 
defined  in  (4).  Suppose  that  P  is  symmetric  and  Q  is  positive  semidefinite.  Let 
Pj  =  ~  Qi  ^  <  j  <  P,  be  P -regular  splittings  such  that  Bj  is  symmetric. 

Then  the  preconditioning  matrix  Mm  defined  by  (7)  is  positive  definite. 

Proof  Since  Pj  =  Bj  -  Cj,  I  <  j  <  p,  are  P-regular  splittings,  from  Corollary 
3.6  of  [2]  it  follows  that  the  block  diagonal  matrix  W  =  P{I  -  H)-^  is  positive 
definite.  On  the  other  hand,  from  (7)  we  can  write 

•Mm  v^  =  (7  +  r+p^  +  ---  +  r^-i),  (9) 

with  T  =  I  -  WM.  From  Theorem  3.5  of  [2]  it  follows  that  p(T)  <  1,  and 
reasoning  in  a  similar  way  as  in  the  proof  of  Theorem  3.4.2  of  [9]  it  is  obtained 
that  the  eigenvalues  of  M-^W  are  positive.  Then  from  Theorem  A.2.7  of  [9]  the 
proof  is  completed. 

3  Numerical  experiments 

In  the  experiments  the  problem  to  be  solved  comes  from  the  discretization  of  the 
Laplace  s  equation,  =  0,  satisfying  Dirichlet  boundary  conditions 
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on  the  unit  square  Q  =  [0, 1]  x  [0, 1].  The  discretization  of  the  domain  f?,  using 
five  point  finite  differences,  with  JxJ  points  equally  spaced  by  h,  yields  a  linear 
system  Ax  =  6,  where  A  is  block  tridiagonal,  A  =  tridiag[-/,  C, -/],  where  / 
and  C  B-re  J  X  J  matrices,  /  is  the  identity,  and  C  =  tridiag[-i, 4, -1],  Note 

that  >1  has  J  X  J  blocks  of  size  J  x  J.  Clearly,  /I  is  a  symmetric  positive  definite 
matrix. 


Let  A  =  P-Q  be  the  Block- Jacobi  splitting  of  yl,  i.e.,  P  =  Diag(v4ii ,  ...,A  ). 
Let  us  consider  square  diagonal  nonnegative  matrices  Dj,  of  size  nj,  1  <  j  <p, 

such  that  Q  +  Diag(Di, .  ■•,£>?)  is  positive  semidefinite.  Then,  it  is’ easy  to'see’ 
that  the  splitting  A  =  P  -Q,  where 


P  =  Diag(Pi, . .  .,Pp),  Pj  =  Ajj  +  Dj,  <3  =  0  +  Diag(Di,...,Dp),  (10) 

satisfies  the  assumptions  of  Theorems  1  and  2. 

Therefore,  in  order  to  ensure  the  hypotheses  of  the  above  theorems  we  consid¬ 
ered  in  our  examples  a  block  splitting  as  in  (10),  where  Ajj  =  tridiag[-/,C,  -7], 

l<j<p,a«dD  =  Diag{  ^  I,-,, I .  ■£  |f„y|),withg  =  fe],<y<„. 

T  ,,  J=l,jYn 

n  these  experiments  reported  here,  we  use  as  inner  iterative  procedure  the 
Jacobi  method. 

The  parallel  experiments  have  been  run  on  two  different  parallel  computer 
systems.  The  first  platform  is  an  IBM  RS/6000  SP  with  8  nodes.  The  second 
platform  is  an  ethernet  network  of  five  120  MHz  Pentiums.  The  peak  performance 
of  this  network  is  100  Mbytes  per  second. 

We  experimented  with  different  matrix  sizes.  The  matrices  were  partitioned 
^cording  to  the  number  of  available  processors  .  The  conclusions  were  similar 
matrices.  Here  we  discuss  the  results  for  two  matrices  of  size  1024 
and  4096  which  correspond  to  grid  sizes  of  32  and  64,  respectively 

The  initial  v^tor  used  was  r(«)  =  (0, 0, . . . ,  0)^  and  the  right  hand  side  was 
0  —  (1, 1, . . 1)^  .  The  stopping  criterion  used  was  •  r  <  10~®,  where  r  is 
the  residual  at  the  corresponding  iteration.  All  times  are  reported’ in  seconds. 
In  the  results  we  use  the  notation  2^6^  to  represent  that  q{j)  =  2,  j  =  1,  and 
9(i)  =  6.  i  =  2.  Similar  notation  is  used  for  other  block  two-stage  PCG  methods. 

Tables  1  and  2  show  the  behavior  of  some  PCG  methods  for  the  above  Laplace 
matrices.  We  compare  these  methods  with  the  well-known  m-step  Block-Jacobi 
PCG  method  that  has  potentially  excellent  parallel  properties.  In  this  case  the 
subdomain  problems  are  solved  by  using  the  Choleski  complete  factorization’ (see 
e.g.  [9]).  One  can  observe  that  the  use  of  two-stage  preconditioners  gives  better 
resu  ts  than  the  use  of  the  Block-Jacobi  preconditioner.  The  conclusions  are 
similar  on  both  multiprocessors.  However,  the  computing  platform  has  obviously 
an  influence  in  the  performance  of  a  parallel  implementation.  So,  the  efficiency 
decreases  notoriously  when  the  number  of  processors  increases.  This  fact  is  due 
to  the  inadequate  use  of  the  processors  when  the  number  of  processors  increases 
for  a  fixed  matrix,  because  the  cost  of  the  operations  performed  in  parallel  can 
be  smaller  than  the  cost  of  communications.  For  example,  in  the  last  block 
partitioning  of  Table  2  using  four  processors  for  the  cluster  of  Pentiums  it  obtains 
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REAL  times  between  3.04  and  7.21  seconds,  however  the  CPU  times  are  between 
0.68  and  1.54  seconds.  Here  the  network  is  very  slow  compared  to  the  network 
in  the  other  computing  platform. 

On  the  other  hand  we  observed  that  generally  the  optimal  number  of  steps 
m  is  two  for  any  size  of  the  diagonal  blocks.  However,  it  seems  that  the  choice 
of  the  number  of  inner  iterations  {q{j))  is  dependent  of  the  size  of  the  diagonal 
blocks.  So,  an  optimal  sequence  of  inner  iterations  is  that  a  little  greater  than 
one  producing  a  priori  a  load  balance  based  on  the  block  size  assigned  to  each 
processor. 

We  have  observed,  in  some  cases,  that  when  the  number  of  steps  is  odd,  then 
the  number  of  iterations  increases  with  respect  to  the  previous  even  number  of 
steps.  This  fact  is  due  to  the  condition  number  of  the  matrix  A  =  SAS^  that 
is  similar  to  the  matrix  M'^A.  Then,  cond(i)  =  ■  where  A,„.„(r^) 

and  \max  )  are  respectively  the  minimum  and  maximum  eigenvalues  of  . 
Therefore,  if  T  has  negative  eigenvalues  and  m  is  odd,  the  numerator  of  cond(A) 
is  greater  than  one.  However,  if  m  is  even,  the  numerator  is  always  less  than  one. 
Thus,  we  must  expect  a  better  decreasing  of  cond{A)  for  even  values  of  m. 


Table  1,  Parallel  implementation  of  the  PCG  method  on  the  solution  of  Laplace 
problems.  Size  of  matrix  A:  1024. 


#  Proc. 

1  Block  two-stage  PCG 

1  Block-Jacobi  PCG  | 

9(i) 

It. 

Time 

Time 

It. 

Time 

Time 

_ 

cittfter 

tp3 

eieeter 

■pa 

(■n 

u 

1^ 

wiiiia 

0.090 

iD 

27 

0.51 

itXfk  f| 

■ 

H 

31 

0.61 

■ 

■jjl 

n 

B  9 

21 

0.44 

HnS  m 

8 

2 

B  9 

25 

0.57 

0.056 

m 

2 

B  9 

14 

0.43 

0.051 

■ 

12 

0.41 

0.050 

D 

Ril 

II  2 

59 

1.13 

liU  J 

2*6’ 

30 

0.58 

11 

2.64 

1.73 

2 

2" 

23 

0.58 

0.066 

■■ 

2 

3*6* 

19 

0.54 

0.066 

6 

2.42 

1.71 

3 

11 

50 

■■1^1 

0.128 

352 

B 

4" 

20 

0.49 

11 

1.17 

9^ 

352 

26 

0.084 

320 

2 

5^ 

13 

0.39 

^^3 

8 

0.17 

3 

28 

0.80 

3 

2^ 

16 

3 

3® 

17 

0.56 

3 

4^ 

12 

0.44 

D 

0.97 

0.18 
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Table  2.  Parallel  implementation  of  the  PCG  method  on  the  solution  of  Laplace 
problems.  Size  of  matrix  A:  4096. 


II  #  Proc. 

1  Block  two-stage  PCG 

1  Block-Jacobi  PCG  | 

IfflB 

j 

102 

5.93 

1 

47 

3.18 

Eil 

19 

16.47 

9.01 

1 

52 

4.12 

0.36 

ImebM 

J 

32 

3.02 

0.32 

11 

16.40 

8.06 

L4 

1 

1 

101 

7.21 

1 

1 

38 

3.04 

21 

9.58 

4.01 

2 

1 

51 

4.95 

0.41 

2 

1 

27 

3.12 

0.27 

14 

9.72 

mm 

m 

22 

3.36 

0.29 

13 

10.28 

References 


1.  Adams,  L.:  M-step  preconditioned  conjugate  gradient  methods.  SIAM  Journal  on 
Scientific  and  Statistical  Computing,  Vol.  6  (1985)  452-462 

2.  Castel,  M.  J.,  Migallon,  V.,  Penades,  J.:  Parallel  two-stage  iterative  methods  for 
hermitian  positive  definite  matrices.  Technical  Report  96-03,  Departamento  de  Tec- 
nologia  Informatica  y  Computacion,  Universidad  de  Alicante,  Spain,  (1996) 

3.  Concus,  P.,  Golub,  G.H.,  O’Leary,  D.P.:  A  generalized  conjugate  gradient  method 
for  the  numerical  solution  of  elliptic  partial  differential  equations.  In:  Buch,  J.  R., 
Rose,  D.  J.  (eds.):  Sparse  Matrix  Computations.  Academic  Press,  (1976)  309-332  ' 

4.  Frommer,  A.,  Szyld,  D.  B.c  /f-splittings  and  two-stage  iterative  methods.  Nu- 
merische  Mathematik,  Vol.  63  (1992)  345-356 

5.  Hestenes,  M.  R.,  Steifel,  S.  R.:  Methods  of  conjugate  gradient  for  solving  linear 
systems.  J.  of  Res.  Nat.  Bureau  Standards,  Vol.  49  (1952)  409-436 

6.  LaMkron,  P.  J.,  Rose,  D.  J.,  Szyld,  D.  B.:  Convergence  of  nested  iterative  methods 
for  Imear  systems.  Numerische  Mathematik,  Vol.  58  (1991)  685-702 

7.  Migallon,  V.,  Penades,  J.:  Convergence  of  two-stage  iterative  methods  for  hermitian 
pwitive  defimte  matrices.  Applied  Mathematics  Letters,  Vol.  10(3)  (1997)  79-83 

8.  Nichok,  N.  K.:  On  the  convergence  of  two-stage  iterative  processes  for  solving  linear 
equations.  SIAM  Journal  on  Numerical  Analysis,  Vol.  10  (1973)  460-469 

9.  Ortega,  J.  M.:  Introduction  to  Parallel  and  Vector  Solution  of  Linear  Systems 
Plenum  Press,  New  York  (1988) 


740 


VECPAR  ’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


Parallelization  of  a  Direct  Method  for  Systems 
of  Linear  Equations 

M.  F.  Costa^  and  R.  M.  Ralha- 

^  Depajtamento  de  Matematica,  Universidade  do  Minho, 

Campus  de  Azurem,  4800  Guimaraes,  Portugal 
mf cQmath . uminho . pt 

^  Departamento  de  Matematica,  Universidade  do  Minho, 

Campus  de  Gualtar,  4710  Braga,  Portugal 


Abstract.  In  this  paper  we  study  a  sequential  version  of  the  Gaussiaji 
elimination  method  in  which  several  pivots  are  used  in  each  reduction 
step.  We  carry  out  an  error  analysis  and  establish  an  upper  bound  for 
the  error  in  the  solution.  In  all  our  tests  (in  which  we  have  used  ran¬ 
dom  matrices  as  well  as  matrices  of  special  types)  the  numerical  results 
produced  by  an  implementation  of  the  algorithm  are  as  good  as  those 
produced  by  the  classical  method.  From  the  point  of  view  of  sequential 
processing,  the  new  method  is  as  efficient  as  the  classical  method  and  we 
believe  that  it  has  advantages  for  parallel  processing  since  it  allows  bet¬ 
ter  load  balancing  and  computation/communication  overlap.  We  develop 
a  parallel  implementation  of  the  new  method  in  a  distributed  memory 
system  with  a  ring  topology  and  give  a  performtince  analysis  of  the  par¬ 
allel  algorithm  based  on  the  study  of  the  load  balancing  and  the  cost 
of  communication  between  processors.  We  present  preliminary  results  of 
some  computational  experiences  with  the  parallel  algorithm. 

1  Introduction 

Much  work  has  been  published  in  the  last  years  on  the  parallel  solution  of  large 
systems  of  linear  equations.  A  considerable  number  of  publications  treat  the  par¬ 
allelization  of  the  old  method  of  Gauss  with  partial  pivoting  [3]  [5]  [6]  [7]  [9]  [11]  [12] 
[14]  [15].  The  main  problem  of  any  implementation  of  this  method  in  a  multipro¬ 
cessor  machine  resides  in  the  need  to  incorporate  partial  pivoting  to  guarantee 
the  numerical  stability  of  the  method.  This  happens  because,  at  each  step,  the 
search  for  the  pivotal  row  forces  the  synchronization  of  the  activity  of  several 
processors  and  part  of  the  time  is  spent  on  communication  and  waiting.  To 
minimize  these  problems,  we  propose  a  modification  of  the  method  of  Gaussian 
elimination  which  consists  in  the  use  of  several  pivots  in  each  reduction  step;  we 
first  study  a  sequential  version  of  the  modified  method  and  then  proceed  with  its 
parallelization.  Our  proposal  is  significantly  different  from  another  variant  of  the 
method  know  as  “pairwise  pivoting”  which  has  been  introduced  by  Wilkinson 
[1]  and  more  recently  used  by  others  in  the  context  of  parallel  processing  [3][5]. 
As  it  is  also  the  case  with  pairwise  pivoting  [2]  [4],  one  possible  drawback  of  our 
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pivoting  strategy  is  that  the  theoretical  upper  bound  for  the  error  in  the  solution 
is  larger  than  in  the  classical  method;  nevertheless,  in  our  numerical  experiments 
the  errors  produced  by  both  methods  were  found  to  be  comparable. 


2  Gaussian  elimination  with  several  pivots  in  each  step 

Given  a  system  Ax  =  b  with  A  £  non  singular,  consider  the  matrix  {A\b) 
divided  into  nB  blocks  of  R  contiguous  rows.  In  the  process  of  reducing  A  to 
triangular  form,  we  consider  the  /cth  reduction  step  {k  =  1,2,  ...,n  -  1)  as  a 
sequence  of  two  phases.  The  first  phase  occurs  at  an  internal  level  within  each 
block  and  the  second  phase  involves  the  various  blocks. 


/  01,1 

Ol,2 

^l,n 

h 

Ofl.l 

Ofl,2 

bn 

Ofl+1,1 

Ofl+1,2 

Oi?+l,n 

bR+i 

02fl,l 

02fl,2 

02fl,n 

b-iR 

^  I  ;  • 

0(n-fi)+l,l 

0(n-fi)+l,2  •  • 

•  0(n-i?)+l,n 

b(n-R)  +  l 

\  a„,2  •  •  •  a„,„  bn  ) 

Description  of  the  first  step  of  reduction:  for  L  =  1, 2, ...,  nS  select  a  pivotal 
roA^,  called  local  pivotal  row,  let  us  say  row  pL  where r 


Next,  if  ap,.i  ^  0  each  row  i  {i  =  {L -l)R+l,  ...^LR,  i  Pl)  is  replaced  bv 
Its  sum  with  row  pL  multiplied  by  mi,i  =  -Ot^i/apL,!- 

Once  these  elementary  operations  are  concluded  in  each  block,  one  still  needs 
to  annihilate  nB-\  elements  in  the  first  column.  To  do  this,  a  global  pivotal  row 
IS  selected  among  the  nB  local  pivotal  rows,  which  is  row  p,  where: 


Assuming  that  ap.i  ^  0,  we  finalize  the  first  step  of  reduction  by  replacing  the 
remaining  local  pivotal  rows  with  its  sum  with  the  global  pivotal  row  multiplied 
—  ~QpL,i/<^p,i  (T  =  l,...,nB  and  pL  ^  p),  where  one  interchanges 
rows  p  and  1  if  p  1,  so  that  in  the  end  the  matrix  of  the  system  is  in  triangular 
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foim.  In  the  remaining  n  —  2  reduction  steps  one  proceeds  in  an  analogous  way. 
Note  that  initially  the  number  of  local  pivotal  rows  equals  the  number  nB  of 
blocks  but  such  number  will  decrease  along  the  process  of  elimination,  as  the 
number  of  blocks  involved  in  the  reduction  to  triangular  form  decreases. 


3  Matrix  formulation  of  the  method 

A  matrix  formulation  of  the  method  with  several  pivots  in  each  reduction  step 
can  be  described  in  terms  of  products  with  non  unitary  elementary  matrices 
(Gauss  transformations)[18].  Denoting  the  matrices  involved  in  the  local  and 
global  stages  of  the  A;th  step  respectively  by  Mg  i,  and  Mk,  we  have: 


Mk,L  =1- 


Mk  =  r- 

where; 

-  /cl  is  the  index  of  the  local  pivotal  row  in  block  L 

-  represents  the  vector  of  multipliers  of  used  in  the  local  stage,  in  block 

r  ,  ■  (A— 1)  ,{k-l)  .  ° 

B  {nii^k  —  <ii,k  /<^Al>A)  i  —  •••)  LR) 

-  is  the  ^Lth  column  of  the  identity  matrix 

-  fc  is  the  index  of  the  global  pivotal  row 

-  represents  the  vector  of  multipliers  used  in  the  global  phase 

We  wall  also  denote  the  elementary  permutation  matrices  by  Pk,L  and  Pk 
when  referring  to  permutation  of  rows  in  the  local  phase  (i.e.,  interchange  of  two 
local  rows  in  block  L  )  and  in  the  global  phase  (i.e.,  permutation  of  rows  from 
two  distinct  blocks),  respectively.  Therefore,  at  the  end  of  step  n  -  1  we  have  a 
triangular  matrix  U  given  by 

step  (n-l)  step  (n-R)+l  step  (n-R) 

nr  ■  ■  ■  ^i^-^)+^^{n-R)+\^(n-R)P{n-R)Mn-R,nBPn-R,nB  ■  ■  ■ 

■  .  ■  MfiPfiMji^nBPR.nB  •  •  •  •  •  •  MiPiMi^nBPl.nB  ■  ■  •  iPi  \  A=-U. 

- - - - - ^  - V - ’  ’  - 

Step  R  step  1 

In  terms  of  fcictorizntion,  we  have  A  =  LU  where 

L  =  Pi.iM^}  . . .  . P„_1 

is  not  necessarily  a  lower  triangular  matrix.  However,  if  L  is  required  for  practical 
purposes,  it  can  be  readily  obtained  as  a  product  of  simple  matrices,  according 
to  the  previous  expression. 
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Example  (n=6,  n5=2): 

/  1-1 -1-1 -1-1 
-1  2  0  0  0  0 
-10  3  111 
-10  14  2  2 
-10  12  5  3 

V-1  0  1  2  3  6 

' - - 

A 

4  Error  analysis 

A  detailed  error  analysis  for  the  new  method  is  given  in  [18]  where  it  is  shown 
that  the  calculated  solution  x  satisfies  the  system: 

(A  +  E)x  =  b 

with 


^  15n,^p||A.||(X)  -f-  0{v?) 

In  the  Gauss  elimination  method  with  partial  pivoting  one  has  [10]: 


ll'^lloo  ^  8n^p||i4||oo  -l-  0(u^) 

Therefore,  the  limit  for  the  rounding  errors  in  the  new  method  is  more  pes¬ 
simist  because  of  the  factor  15n^.  At  this  point,  one  should  bear  in  mind  that 
the  factor  is  usually  ignored  in  the  discussion  of  the  stability  of  Gaussian 
elimination.  As  stated  in  [10],  p.65;  “...usually,  the  bound  itself  is  weaker  than 
it  might  have  been  because  of  the  necessity  of  restringing  the  mass  of  detail  to  a 
reasonable  level  and  because  of  limitations  imposed  by  expressing  the  errors  in 
terms  of  matrix  norms”.  It  is  usually  considered  that  the  numerical  stability  of 
the  method  depends  on  the  size  of  a  growth  factor  p.  We  adopted  the  definition 

max  I  a,- I 

1  l'  it  *  ’ 

P=—t - T 

max  Ci  ,• 

i^jyk 

given  in  [16].  Although,  in  theory,  p  can  be  as  large  as  2"“\  in  practice  such 
growth  is  extremely  improbable  and  p  is  generally  of  the  order  10.  Indeed,  in 
the  computational  experiences  carried  out  with  both  methods,  we  found  p  to  be 
always  of  such  order  of  magnitude  (see  table  1).  Based  on  this,  we  claim  that  the 
numerical  properties  of  the  new  method  are  comparable  to  those  of  the  classical 
algorithm. 


\  /  1  0  0  0  00\  /l-l-l-l-l-l\ 

-1  1  0  0  00  0  1  -1  -1  -1  -1 

^  -1  -1  1  0  00  0  0  1  -1  -1  -1 

-1-1-1-i-i  1  0  0  0-2  3  1 

-1-1-1  I -|  1  0  0  0  0-2  3 

J  \-l-l-i  I  I  1/  \o  0  0  0  0  i/ 

V  V 

L  U 
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5  Computational  experiences 


We  implemented  our  algorithms  (both  sequential  and  parallel)  on  a  transputer 
based  machine.  In  the  computational  tests  we  found  out  that  the  new  method 
and  the  classical  method  with  partial  pivoting  produce  solutions  with  the  same 
precision,  independently  of  the  type  of  matrix  used.  This  can  be  appreciated  in 
table  1  for  random  matrices  of  different  sizes.  In  all  cases  we  have  used  a  vector  b 
corresponding  to  the  exact  solution  x,  =  1  (z  =  1, ...,  n),  so  that  we  can  indicate 
the  absolute  error  ||x-a;||oo.  Also,  the  execution  times  are  essentially  the  same 
for  both  methods,  although  in  the  case  of  our  method  we  are  using  several  con¬ 
current  processes  (in  this  set  of  experiments  we  set  R  =  10,  i.e.,  we  decomposed 
each  matrix  in  n/10  blocks  of  10  rows  each);  the  transputer  hardware  handles 
efficiently  the  execution  of  concurrent  processes  and  the  overhead  due  to  this  is 
very  small,  as  it  can  be  better  understood  for  the  matrix  of  size  n  =  100,  since 
in  this  case  a  single  processor  is  running  10  concurrent  processes. 


D 

method 

||Ai  -b||« 

ll»-x||oo 

P 

run  time  (sec.) 

EQI 

Ours 

1.78E-15 

1.64E-14 

from 

0.00614 

m 

Classic 

1.78E-15 

1.64E-14 

0.00589 

Kil 

Ours 

5.33E-15 

6.88E-15 

QQ 

0.0346 

Kil 

Classic 

3.55E-15 

1.24E-14 

roron 

0.0331 

Bil 

Ours 

2.13E-14 

2.13E-13 

0.398 

Bil 

Classic 

0.384 

iron 

8.53E-14 

2.17E-13 

2.84 

ironi 

Classic 

3.55E-14 

2.99E-13 

Hiroi 

2.76 

Table  1:  results  obtained  with  the  two  methods  on  a  single 


processor. 


To  make  more  clear  that  the  numerical  precision  of  the  solution  does  not 
vary  significantly  with  the  number  nB  of  blocks  used,  and  that  the  execution 
time  increases  only  slightly  we  tested  our  method  with  a  certain  matrix  of  size 
n  =  100,  using  successively  1,  4,  5,  10  and  20  blocks.  The  results  are  listed  in 
table  2. 


WSi 

run  time  (sec.) 

n 

6.39T;  -  14 

6.57E  -  13 

2.77 

B 

9.95£:-,14 

1.40^-13 

2.77 

B 

6.39T;-  14 

1.16£^-  13 

2.78 

B 

5ME  -  14 

1.26E  -  13 

2.78 

B 

8.53£-  14 

2.17£;-  13 

2.84 

B 

7.82E  -  14 

1.66E-  13 

2.89 

Table  2:  varying  the  number  nB  of  blocks  for  a  matrix  of  size  n  =  100 


6  The  parallel  algorithm 

In  the  development  of  the  parallel  application  we  used  a  ring  topology.  Paral¬ 
lelizing  the  algorithm  consists  in  assigning  a  block  of  R  contiguous  rows  of  the 
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matrix  (^|6)  to  each  process  of  the  ring.  In  this  way,  a  local  reduction  is  carried 
out  concurrently  in  each  process  of  the  ring.  Furthermore,  the  task  of  finding 
(and  broadcasting  to  the  still  active  processes)  the  global  pivotal  row  can  pro¬ 
ceed  concurrently  with  the  local  computation.  After  this,  the  processes  finish 
“simultaneously"  the  reduction  step. 


6.1  Load  balancing 

The  load  balancing  of  the  parallel  algorithm  is  not  predictable  since  it  is  not 
possible,  in  general,  to  know  in  advance  which  process  is  the  owner  of  the  global 
pivotal  row  in  each  one  of  the  n  -  1  steps.  Because  of  this,  we  studied  two 
extreme  cases:  the  best  case  occurs  when  the  pivotal  row  belongs  cyclically  to 
each  process  (and  all  processes  will  be  active  almost  till  the  end  of  the  reduction 
to  triangular  form),  the  worst  case  occurs  when  the  first  R  global  pivotal  rows 
belong  to  a  particular  process  (this  process  will  be  idle  in  the  remaining  n  -  R 
steps),  the  next  R  belong  to  another  process,  and  so  on.  In  this  respect  it  is 
interesting  to  note  that  for  matrices  generated  randomly  the  load  balancing  is 
always  near  to  the  ideal  situation  (see  [18]  p.69-72). 


6.2  Efficiency  and  speedup 

A  theoretical  study  of  the  speedup  S:=T{1)/T{P)  and  efficiency  E  :=  S/P  of 
the  parallel  algorithm,  was  carried  out  for  the  extreme  cases  described  before; 
we  obtained  the  following  expressions: 
best  case: 

/[4n3+3n^(P+2)-n(p2-ep^l,)  ^  P(„-j)+2n  \ 

/  {‘ln^+9n^-7n)p  ‘-  +  l2tf(P  1)  _7„)p  j  Orf-t- 


worst  case: 


2n^(3P^-l)+3n°(4P^-P)-7nP' 

{4n^+9n^-7n)P3 


where: 


'+in  (P+1)-4P 
ln^+9n^-7n)P 


6  represents  the  number  of  flops  per  second 
Old  is  the  start-up  time 

0d  is  the  time  required  to  send  a  floating-point  number  through  a  physical 
link. 


In  any  case,  the  efficiency  and  speedup  increases  when  n  grows  and  P  remains 
fixed  arid  decreases  when  P  is  grows  and  n  is  kept  constant.  Using  the  values 
ff  =  10^  Qrf  =  2.6^s,  0d  =  4:.5fis  given  in  [13]  for  the  T800  transputer  and 
considering  P  =  4  and  n  =  100, 200, 300, 400, 600,  one  obtains  from  the  previous 
expressions  the  estimated  values  given  in  table  3. 
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■ . . 

m 

best  case 

worst  case 

1 

3.14(78,5%) 
3,56(88,9%) 
3.70(92, 6%) 
3.78(94, 4%) 
3.85(96, 3%) 

ill 

Table  3:  estimated  values  for  the  speedup  and  efficiency  with  4  processors. 


In  computational  experiences  applied  to  problems  of  dimension  n  =  100  and 
using  4  processors  we  obtained  the  values  for  the  speedup  and  efficiency  given 
in  table  4. 


Matrix 

'I'seq  (sec  .) 

'^par  (sec  •) 

S—T'seq/'I'par 

E=S/4 

Moler 

2.39 

1.39 

1.72 

43,0% 

Frank 

2.44 

1.40 

1.74 

43,6% 

Border 

3.48 

1.72 

2.02 

50,6% 

Dingdong 

2.61 

1.46 

1.79 

44,7% 

Random 

2.78 

1.24 

2.24 

56,0% 

Table  4:  execution  times,  speedup  and  efficiency  of  the  parallel  algorithm  (with 

4  processors). 

The  best  results  were  obtained  with  random  matrices,  as  expected,  since  the 
load  balancing  of  the  parallel  algorithm  was  found  to  be  good  for  such  matrices, 
as  mentioned  before. 
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